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The world goes 
to Mars 


How to reach another planet when a pandemic 
hobbles yours. 


ni5July 1965, humanity got its first close-up 

look at Mars when NASA's Mariner 4 space- 

craft flew past the red planet, recording 

grainy images of a barren, cratered surface. 

They were the first glimpse of another planet 
as seen from space. 

Almost exactly 55 years later, 3 long-awaited Mars 
missions are due to launch (see page 184). Amid a corona- 
virus pandemic and raging geopolitical tensions, the 
missions, from the United States, China and the United 
Arab Emirates (UAE), are a powerful symbol of how nations 
can transcend their Earthly woes as they seek to explore 
and understand other worlds. 

Inthe decades since Mariner 4, NASA has sent 19 missions 
to Mars, 4 of which failed. Today, the agency has three 
active missions orbiting the planet and two robots that 
are carrying out experiments on its surface. The latest US 
mission, Perseverance, which lifts off on 30 July at the ear- 
liest, is meant to push this exploration to the next level. It 
will roll around an ancient river delta in the Jezero Crater, 
searching for signs of past life. More importantly, it will 
drill into Martian rocks and collect rock and dirt samples 
as it travels. The ambition is for a future mission to land 
at Jezero, retrieve these rock samples and return them to 
Earth. If this happens, it would be the first-ever sample 
return from Mars — something researchers can’t wait to 
analyse. 

China’s plan is just as ambitious. Later this month, the 
China National Space Administration intends to launch an 
orbiter, lander and rover combination called Tianwen-1, or 
‘quest for heavenly truth’. Many details have not yet been 
revealed, possibly because of the risk of failure — China 
tried unsuccessfully to send an orbiter to Mars in 2011. 
But it has pulled off several recent impressive accom- 
plishments in space, including a series of Moon missions 
that culminated last year in the first mission to the lunar 
far side. The time may be right for Beijing to succeed in 
reaching Mars. 

And then there is Hope, a Mars orbiter to be launched 
by the six-year-old UAE Space Agency (see page 190) no 
earlier than 15 July. Itis the first interplanetary attempt by 
any Arab nation. Much of the spacecraft technology has 
been developedin collaboration with former NASA mission 
engineers hired by the UAE Space Agency. But the science is 
being primarily driven by Emirati researchers: a young and 
vibrant team of explorers. Hope aims to build the biggest, 
most-detailed map of Martian weather produced so far. 

All three missions, which are due to arrive at Mars next 
February, need to launch in the next few weeks while 
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Equally 
remarkable 
is that 
thethree 
missions 
arenot 
competing 
witheach 
other.” 


Earth and Mars are in the best positions in their orbits for 
a spacecraft to travel between them — an event that hap- 
pens only once every 26 months. It’s remarkable that the 
coronavirus pandemic did not derail their plans. There was 
to have been a fourth Mars mission this summer, but the 
European Space Agency postponed its launch to 2022, in 
part because of the pandemic. NASA had to deploy some 
of its own planes to fly engineers between California and 
Perseverance’s launch site in Florida because commercial 
flights were grounded. Meanwhile, China and the UAE both 
scrambled to finish their missions as COVID-19 raged. 

Equally remarkable is that the three missions are not 
competing with each other, even though some commen- 
tators are calling the present state of US-China relations 
anew cold war. Whereas the original cold war between 
the Soviet Union and the United States dominated both 
nations’ space ambitions in earlier decades, today’s space 
agencies have relatively more-cooperative relationships. 

That said, although NASA and the UAE Space Agency plan 
to make data from their missions publicly available, China’s 
data policy remains unclear. China has been rolling out 
tranches of data from its Moon missions — the third batch 
from its lunar far-side mission was released last month. It 
should join the others, and pledge to share data from its 
Mars mission too. 

Whereas intergovernmental relationships on Earth look 
ever more fraught, researchers must keep trying to tran- 
scend geopolitical squabbles. That includes ensuring that 
international collaboration on these missions continues, 
and that data are quickly made publicly accessible. 

If these three emissaries launch successfully in the 
coming weeks, then we wait. We wait for them to traverse 
hundreds of millions of kilometres through the frigid 
vacuum of space, piloting themselves by the occasional 
command relayed from Earth. Red Mars will appear bigger 
as blue Earth grows smaller. They will arrive early next year 
at an alien, yet strangely familiar, planet. So, too, will we. 


Pulling carbon from 
the sky is necessary, 
but not sufficient 


Carbon dioxide removal is becoming a serious 
proposition — but it is not a substitute for 
aggressive action to cut emissions. 


ould spreading basalt dust on farmers’ fields 
help to remove atmospheric carbon? A large 
multidisciplinary team of scientists is confi- 
dent it could, and that doing so could boost 
crop yields and soil health at the same time. 
In this issue, David Beerling, a biogeochemist at the 
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University of Sheffield, UK, and his colleagues explore a 
strategy to enhance rock weathering (D. J. Beerling et al. 
Nature 583, 242-248; 2020). 

This is acontinuously occurring natural phenomenon 
in which carbon dioxide and water react with silicate 
rocks on Earth’s surface. In the process, atmospheric 
CO, is converted into stable bicarbonates that dissolve 
and then flow into rivers and oceans. The idea of scaling 
up this process to remove carbon has been considered 
for some three decades. The team’s results provide the 
most detailed analysis yet of the technical and economic 
potential of this approach — and some of the probable 
challenges, including gaining public acceptance. 

The researchers modelled what would happen 
to atmospheric carbon if basalt dust was added to 
agricultural lands inthe world’s biggest economies, includ- 
ing Brazil, China, the European Union, India, Indonesia and 
the United States. According to their calculations, doing 
so would remove between 0.5 billion and 2 billion tonnes 
of CO, from the air each year. The upper limit is more than 
5 times the annual emissions of the United Kingdom, and 
akin to offsetting emissions from around 500 coal-fired 
power plants. 

Theteamisalso carrying out field trials in four countries 
—the only such trials yet. The authors have told Nature that 
preliminary results suggest the theory is holding up. The 
application of 20 tonnes of basalt dust to a half-hectare 
UK plot boosted CO, removal by 40% compared with that 
seen onanuntreated plot, and by 15% in another trial, which 
spread dust over oil-palm plantations in Malaysia. The early 
results also indicate that adding basalt boosted yields in 
these and other crops. 

These are encouraging developments at a time when 
governments around the world are struggling to meet 
their climate commitments. The approach, if successful, 
could enable high-emitting countries such as the United 
States and China to remove some of the carbon they have 
pumped into the atmosphere in recent decades. More- 
over, the machines that are required to spread basalt 
dust on fields already exist: farmers use them to treat 
soils with limestone. 


Costing the Earth 


But, like many promising technological fixes, spreading 
basalt dust across the world’s agricultural fields could 
prove more complicated than it first seems. Researchers 
must answer a host of pressing questions about the eco- 
nomic costs and environmental impacts. And there are 
potential questions for regulators, too. 

Tinkering with the geochemical cycle will inevitably alter 
ecosystems in soils, rivers and even oceans. Some of this 
might be beneficial: rock dust of the right variety could 
bolster desirable plant communities, for example. And the 
alkaline content that runs offto the oceans could, intheory, 
counteract acidification, helping to protect corals and 
other creatures that are threatened by rising atmospheric 
CO, levels. But we need to be confident that there are no 
harmful consequences to land and sea, and any potential 
effects would need to be monitored carefully. 
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technological 
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complicated 
than it first 
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Moreover, mining rock on industrial scales, pulverizing 
it and spreading the dust oncrop fields will not be cheap. 
The current price of carbon on the European Union’s 
emissions trading system is less than €28 (US$31) per 
tonne. By contrast, Beerling and his colleagues estimate 
that enhanced rock weathering will cost between $80 
and $180 per tonne of CO,. That said, such costs are in 
line with competing technologies that could be used to 
pull CO, out of the atmosphere. And although rock will 
need to be mined, the Sheffield team is rightly calling for 
an inventory of free, suitable waste rock from existing 
mining operations. This will bring costs down, increase 
carbon uptake and make more efficient use of mined 
materials. 


Citizen science 


The project team also studied how members of the public 
would react to such technologies (E. Cox et al. Nature Clim. 
Change https://doi.org/10.1038/s41558-020-0823-z; 2020). 
From research carried out in the United Kingdom and the 
United States, it is clear that CO,-removal strategies could 
face scepticism. Respondents who took part in surveys and 
workshop discussions feared that they might take too long 
to develop, and expressed concern that the basalt dust 
could affect ocean ecology. Many also opposed the idea 
of such technologies becoming a substitute for tackling 
the root causes of climate change. 

Concerns surrounding the ecological impacts could 
be allayed with appropriate government oversight. But 
there is nointergovernmental process that is considering 
the full suite of issues — including safety and ethics — that 
will need to be addressed if carbon-removal technolo- 
gies are to be applied at significant scales. The Carnegie 
Council for Ethics in International Affairs, a think tank 
in New York City, is working to build awareness among 
governments about the issues they are likely to face if 
these technologies are applied, through the Carnegie 
Climate Governance Initiative. Much of the group’s 
work has been focused on how to regulate technolo- 
gies associated with the ‘geoengineering’ label, such 
as lofting aerosols into the stratosphere to reflect solar 
radiation back into space. Carbon removal, although less 
controversial, is just as important. 

Beerling and his colleagues also deserve credit on this 
front. The University of Sheffield’s Leverhulme Centre 
for Climate Change Mitigation is 4 years into a 10-year, 
£10-million (US$12.5-million) research programme that 
includes modelling and field trials, as well as laboratory 
studies and public-engagement research. But the centre 
cannot be expected to shoulder sucha heavy responsibility 
alone. Other groups and funders need to step up. 

With the dangers of climate change becoming more 
apparent each year, countries must continue to pursue 
the aggressive action that will be required to meet the goals 
of the 2015 Paris climate agreement. Carbon-removal tech- 
nologies cannot be a substitute for such action. But it is 
becoming clear that if humanity is to limit global warming 
to1.5-2 °C above pre-industrial levels, it must pursue every 
promising idea. 
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PRATYUSHA KALLURI 


A personal take on science and society 


World view 


By Pratyusha 
Kalluri 


Don’task if Alis good or fair, 
ask how it shifts power 


Those who could be exploited by artificial 
intelligence should be shaping its projects. 


aw enforcement, marketers, hospitals and other 
bodies apply artificial intelligence (Al) to decide 
on matters such as who is profiled as a criminal, 
who is likely to buy what product at what price, 
who gets medical treatment and who gets hired. 
These entities increasingly monitor and predict our 
behaviour, often motivated by power and profits. 

Itis not uncommon now for Al experts to ask whether an 
Alis ‘fair’ and ‘for good’. But ‘fair’ and ‘good’ are infinitely 
spacious words that any Al system can be squeezed into. The 
question to pose is a deeper one: how is Al shifting power? 

From 12July, thousands of researchers will meet virtually 
at the week-long International Conference on Machine 
Learning, one of the largest Al meetings in the world. Many 
researchers think that Al is neutral and often beneficial, 
marred only by biased data drawn from an unfair society. 
In reality, an indifferent field serves the powerful. 

In my view, those who work in AI need to elevate those 
who have been excluded from shaping it, and doing so 
will require them to restrict relationships with power- 
ful institutions that benefit from monitoring people. 
Researchers should listen to, amplify, cite and collaborate 
with communities that have borne the brunt of surveillance: 
often women, people who are Black, Indigenous, LGBT+, 
poor or disabled. Conferences and research institutions 
should cede prominent time slots, spaces, funding and lead- 
ership roles to members of these communities. In addition, 
discussions of how research shifts power should be required 
and assessed in grant applications and publications. 

A year ago, my colleagues and I created the Radical Al 
Network, building on the work of those who came before 
us. The group is inspired by Black feminist scholar Angela 
Davis’s observation that “radical simply means ‘grasping 
things at the root”, and that the root problem is that power 
is distributed unevenly. Our network emphasizes listening 
to those who are marginalized and impacted by Al, and 
advocating for anti-oppressive technologies. 

Consider an AI that is used to classify images. Experts 
train the system to find patterns in photographs, perhaps 
to identify someone’s gender or actions, or to find amatch- 
ing face ina database of people. ‘Data subjects’ — by which 
Imean the people who are tracked, often without consent, 
as well as those who manually classify photographs to train 
the Al system, usually for meagre pay — are often both 
exploited and evaluated by the Al system. 

Researchers in Al overwhelmingly focus on provid- 
ing highly accurate information to decision makers. 
Remarkably little research focuses on serving data subjects. 
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What’s needed are ways for these people to investigate 
Al, to contest it, to influence it or to even dismantle it. For 
example, the advocacy group Our Data Bodies is putting 
forward ways to protect personal data when interacting 
with US fair-housing and child-protection services. Such 
work gets little attention. Meanwhile, mainstream research 
is creating systems that are extraordinarily expensive to 
train, further empowering already powerful institutions, 
from Amazon, Google and Facebook to domestic surveil- 
lance and military programmes. 

Many researchers have trouble seeing their intellectual 
work with Alas furthering inequity. Researchers such as me 
spend our days working on what are, to us, mathematically 
beautiful and useful systems, and hearing of Al success 
stories, such as winning Go championships or showing 
promise in detecting cancer. It is our responsibility to 
recognize our skewed perspective and listen to those 
impacted by Al. 

Through the lens of power, it’s possible to see why 
accurate, generalizable and efficient Al systems are not 
good for everyone. In the hands of exploitative compa- 
nies or oppressive law enforcement, a more accurate 
facial recognition system is harmful. Organizations have 
responded with pledges to design ‘fair’ and ‘transparent’ 
systems, but fair and transparent according to whom? 
These systems sometimes mitigate harm, but are con- 
trolled by powerful institutions with their own agendas. 
At best, they are unreliable; at worst, they masquerade as 
‘ethics-washing’ technologies that still perpetuate inequity. 

Already, some researchers are exposing hidden 
limitations and failures of systems. They braid their 
research findings with advocacy for Al regulation. Their 
work includes critiquing inadequate technological ‘fixes’. 
Other researchers are explaining to the public how natural 
resources, dataand human labour are extracted to create Al. 

Race-and-technology scholar Ruha Benjamin at 
Princeton University in New Jersey has encouraged us to 
“remember to imagine and craft the worlds you cannot 
live without, just as you dismantle the ones you cannot 
live within’. In this vein, it is time to put marginalized and 
impacted communities at the centre of Al research — their 
needs, knowledge and dreams should guide development. 
This year, for example, my colleagues and! held a workshop 
for diverse attendees to share dreams for the Al future we 
desire. We described AI that is faithful to the needs of data 
subjects and allows them to opt out freely. 

When the field of Al believes it is neutral, it both fails to 
notice biased data and builds systems that sanctify the 
status quo and advance the interests of the powerful. What 
is needed isa field that exposes and critiques systems that 
concentrate power, while co-creating new systems with 
impacted communities: Al by and for the people. 
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The world this week 


Newsin brief 


CORONAVIRUS 
TEST FREQUENCY 
MATTERS MORE 
THAN SENSITIVITY 


Communities such as 
universities, where COVID-19 
cases could quickly spiral out 
of control, should test large 
numbers of people frequently 
for the new coronavirus — even 
if that means using a relatively 
insensitive test. 

That’s the conclusion 
of Michael Mina at the 
Harvard T.H. Chan School 
of Public Health in Boston, 
Massachusetts, and his 
colleagues, who modelled the 
effect of widespread testing 
on viral spread ina large 
group of people to gauge the 
importance of test sensitivity 
(D. B. Larremore et al. Preprint 
at medRxiv http://doi.org/ 
d2gt; 2020). Tests that rely on 
the technique quantitative 
polymerase chain reaction 
(qPCR) can detect the merest 
traces of SARS-CoV-2 genetic 
material but are expensive and 
slow to return results. 

The researchers found that 
weekly surveillance testing, 
paired with isolation of infected 
people, would limit an outbreak 
even if the testing method was 
less sensitive than qPCR. By 
contrast, surveillance testing 
done every 14 days would allow 
the total number of infections 
to climb almost as high as if 
there were no testing at all. The 
findings have not yet been peer 
reviewed. 


Astronomers 
unveil epic 
X-ray map of 
the Universe 


The newest map of the sky charted in high-energy 
X-rays offers a glimpse of what the Universe would look 
like if seen with X-ray vision. Researchers created the 
image using data from an instrument called eROSITA 
(Extended Roentgen Survey with an Imaging Telescope 
Array), part of the German-Russian satellite mission 
Spectrum-Roentgen-Gamma. 

After sweeping the sky for six months, eROSITA 
has charted more than one million sources of X-ray 
radiation, including gigantic black holes, galactic 
clusters and the remnants of supernova explosions, 
many of which are new to science. 

Researchers hope that a detailed map in this part of 
the spectrum will offer new ways to track the Universe’s 
expansion, and to study the mysterious repulsive force 
called dark energy. The telescope will survey the X-ray 
sky for another 3.5 years. 

“eROSITA has already revolutionized X-ray 
astronomy,’ said Kirpal Nandra, a high-energy physicist 
at the Max Planck Institute for Extraterrestrial Physics 
in Garching, Germany, where the project team is based. 
“But this is just a taste of what’s to come.” 
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The world this week 


News in focus 
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Fire tore through several rooms used to store plant, animal and human specimens a 
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ce 


t the Federal University of Minas Gerais in Brazil. 


SECOND BRAZILIAN MUSEUM 
FIRE IN TWO YEARS REIGNITES 
CALLS FOR REFORM 


Blaze at a natural history museum in Minas Gerais is forcing some 
researchers to relive the pain of losing priceless specimens and artefacts. 


By Emiliano Rodriguez Mega 


esearchers in Brazil are sifting through 

the ashes of a fire that destroyed part 

of amuseum in the southeastern state 

of Minas Gerais on 15 June. The blaze 

follows repeated warnings about fire 

risks at museums, and comes less than two 

years after a massive inferno gutted the prized 
National Museum in Rio de Janeiro. 

The latest fire has reopened wounds in the 

research community and intensified a national 

conversation about the need to protect Brazil’s 


cultural and scientific heritage. 

Mariana Lacerda, a geographer at the 
Federal University of Minas Gerais (UFMG) 
in Belo Horizonte, received a disturbing 
Monday-morning call: a building at the uni- 
versity’s Natural History Museum and Botan- 
ical Garden, which she'd directed for almost 
a year, was in flames. When she arrived on 
the scene, smoke was still coming out of a 
single-storey building that housed thousands 
of artefacts, skeletal remains and taxidermied 
animals, many collected several decades ago. 

Two storage rooms full of fossils and large 
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archaeological objects were covered in soot 
and smoke. Flames had partly consumed a 
third room, which housed folk art, Indigenous 
artefacts and biological specimens. Two fur- 
ther rooms, housing important collections 
of insects, shells, birds, mammals, human 
bones and ancient plant remains, were almost 
completely lost. 

For these, “little hope remains of material 
that can be recovered”, Lacerda says. “Some- 
thing thatis so slow to build was destroyed so 
quickly, injust over an hour.” 

Archaeologist André Prous, who started 
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News in focus 


HISTORY IN FLAMES 


Since 2010, at least six sites relevant to Brazil’s 
scientific history have caught fire. There has 
also been damage to other cultural sites. 


2010 e— Butantan Institute, Sao Paulo — An inferno 
destroyed almost 90% of the museum's 
snake bank, the largest in Latin America, 
and a small part of its arachnid collection. 
Buntantan has been able to replace only 
about one-third of its snake specimens. 


2012 ®— Comandante Ferraz Antarctic Station 
—A fire that started in the machine 
room housing the power generators 
destroyed approximately 70% of the 


2013 ®- research station. Two people died. 


‘_ Museum of Natural Sciences, Belo 
Horizonte — The museum, owned by the 
Pontifical Catholic University of Minas 
Gerais, went up in flames in January. 


2015 @— Museum of the Portuguese Language, 
So Paulo — In December, a major fire 
destroyed the museum building and 
killed one firefighter. The museum has 
since been rebuilt and was scheduled 
to open its doors in June 2020. 


-- National Museum, Rio de Janeiro — 
In September, a blaze at the National 
Museum claimed many of the most prized 
records of the nation’s past. Recovery and 


2018 e— ‘reconstruction efforts continue. 


-- Natural History Museum and Botanical 
Garden, Belo Horizonte — A fire destroyed 
parts of the museum owned by the Federal 
University of Minas Gerais, affecting stored 
zoological and archaeological specimens. 
As yet, authorities have no estimate of the 
scope of the loss. 


2020 e— 


working at the museum in 1975, was devas- 
tated. He and his colleagues had amassed a 
collection of human remains from a range 
of periods, including some from the earliest 
known inhabitants of Brazil, as well as sam- 
ples of cultivated and wild plant species. Prous 
had also seen part of his life’s work disappear 
during the 2018 fire at the National Museum, 
when ancient skulls that he helped to collect 
in the 1970s were destroyed. 

“The sadness is matched only by the fear 
that other, similar disasters will continue to 
destroy [Brazil’s] scientific heritage,” he says. 
Some stone artefacts, ceramics and documen- 
tation of the sites he has excavated survived 
the blaze. 


Historic losses 


Brazilian museums have faced aseries of fires, 
often resulting in irreparable losses, says 
Carolina Vilas Boas, director of museum pro- 
cesses at the Brazilian Institute of Museums in 
Brasilia. At least 12 buildings of cultural or sci- 
entific significance have burnt inthe country, 
many of them inthe past 10 years (see ‘History 
in flames’). But the full extent of the damage is 
hard to know, says Vilas Boas, because report- 
ing is probably incomplete. 

Brazil is not unique in losing heritage 
institutions to fire, she says, but the country 
does have a poor record in taking care of its 
museums. Often, fire-prevention systems are 
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installed, but budgets are too thin to maintain 
them properly. “There are many actions being 
taken to mitigate this risk,” she says, but recur- 
ring economic crises have hindered long-term 
planning. 

“That lack of resources had no relation to 
the fire in the collection’s storage rooms,” says 
Ricardo Hallal Fakury, a structural engineer 
at the UFMG. He did not speculate as to the 
cause of the fire, because investigations are still 
under way. But he says that the building that 
burnt was equipped with smoke detectors, and 
was mostly built of non-flammable materials. 


Federal pressure 
The tragedy in Belo Horizonte has amplified 
a decades-long discussion among Brazilian 
scientists pushing for national and state-level 
policies to help protect research collections, 
says Luciane Marinoni, an entomologist at the 
Federal University of Parana and president 
of the Brazilian Society of Zoology, both in 
Curitiba. “The community is upset because 
we have been trying to solve this problem with 
the federal government but without success.” 
Some protective policies already exist. In 


2017, the southern state of Parana established 
norms and guidelines for the recognition 
of biological collections, defining who has 
responsibility for them, and putting in place 
objectives and goals to expand them and pro- 
vide maintenance. Last year, the policy helped 
researchers to convince the government of 
Paranato allocate 2 million reais (US$370,000) 
for the state’s collections over the next three 
years. It’s not a lot of money, but it’s a solid 
start, says Marinoni: “The collections are 
leaving the darkness.” 

Back in Belo Horizonte, scientists are clean- 
ing up after the fire. This time, however, they 
have some guidance on howto move forward. 

National Museum researchers have teamed 
up with Lacerda to advise on the recovery of 
items that might still be salvageable. They are 
sharing protocols they developed after the 
2018 blaze with UFMG professors and students 
who have volunteered to help. “Unfortunately, 
we are now experts in this matter,” says palae- 
ontologist Alexander Kellner, director of the 
National Museum. “We went through it. We 
know the mistakes to avoid, we have a way to 
act, we have a methodology.” 


PHYSICISTS FIND BEST 
EVIDENCE YET FOR 
ELUSIVE 2D STRUCTURES 


Strange quasiparticles called anyons could 
herald a way to build quantum computers. 


By Davide Castelvecchi 


hysicists have reported what could be 
the first incontrovertible evidence for 
the existence of unusual particle-like 
objects called anyons, which were first 
proposed more than 40 years ago. 
Anyons are the latest addition to a growing 
family of phenomena called quasiparticles, 
which are not elementary particles, but are 
instead collective excitations of many elec- 
trons in solid devices. Their discovery — made 
using a 2D electronic device — could represent 
the first steps towards making anyons the basis 
of future quantum computers. 

“This does look like a very big deal,” says 
Steven Simon, a theoretical physicist at the 
University of Oxford, UK. The results, which 
have not yet been peer-reviewed, were posted 
onthe arXiv preprint server last week’. 

Known quasiparticles display a range of 
exotic behaviours. For example, magnetic 
monopole quasiparticles have only one mag- 
netic pole — unlike all ordinary magnets, which 
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always have a north and a south. Another 
example is Majorana quasiparticles, which 
are their own antiparticles. 

Anyons are even more strange. All 
elementary particles fall into one of two 
possible categories — fermions and bosons. 
Anyons are neither. The defining property of 
fermions (which include electrons) is Fermi 
statistics: when two identical fermions switch 
spatial positions, their quantum-mechani- 
cal wave — the wavefunction — is rotated by 
180°. When bosons exchange places, their 
wavefunction doesn’t change. Switching two 
anyons should produce a rotation by some 
intermediate angle. This effect, which is 
called fractional statistics, cannot occur in3D 
space, but only as collective states of electrons 
confined to move in two dimensions. 


Fractional statistics 


Fractional statistics is the defining property of 
anyons, and the latest work — led by Michael 
Manfra, an experimental physicist at Purdue 
University in West Lafayette, Indiana — is the 
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The ‘pyjama stripe’ interference pattern denotes the presence of anyons in an electronic system. 


first time it has been measured so conclusively. 

The quasiparticles’ unusual behaviour when 
switching places means that if one particle 
moves ina full circle around another — equiv- 
alent to the two particles switching positions 
twice — it will retaina memory of that motion 
in its quantum state. That memory is one of 
the telltale signs of fractional statistics that 
experimentalists have been looking for. 

Manfra and his team manufactured a 
structure consisting of thin layers of gallium 
arsenide and aluminium gallium arsenide. 
This confines electrons to move in two 
dimensions, while shielding them from stray 
electric charges in the rest of the device. The 
researchers then cooled it to 10,000ths of a 
degree above absolute zero and added astrong 
magnetic field. This produced a state of matter 
in the device called a fractional quantum Hall 
(FQH) insulator, which has the peculiarity 
that no electric current can run in the inte- 
rior of the 2D device, but can run along the 
edge. FQH insulators can host quasiparticles 
whose electric charge is not a multiple of the 
electron charge, but is instead one-third of it: 
these quasiparticles have long been suspected 
to be anyons. 

To prove that they had indeed detected 
anyons, the researchers etched the device so 
that it could carry currents from one electrode 
to another along two possible edge paths. 
The team tweaked the conditions by varying 
the magnetic field and adding an electric 
field. These tweaks were expected to create 
or destroy anyon states stuck in the interior, 
and also to produce anyons running between 


the electrodes. Because moving anyons had 
two possible paths, each producing a differ- 
ent twist in their quantum-mechanical waves, 
when the anyons reached the end point, their 
wavefunctions produced an interference 
pattern called pyjama stripes. 

This pattern shows how the relative amount 
of rotation between the two paths varies in 
response to changes in the voltage and the 
magnetic-field strength. But the interference 
also displayed jumps, which were the smoking 
gun’ for the appearance or disappearance of 
anyons in the bulk of the material. 

“As far as I can tell, it is an extremely solid 
observation of anyons — directly observing 
their defining property: that they accumulate 
a fractional phase when one anyon travels 
around another,” Simon says. 

It is not the first time that researchers have 
reported evidence of fractional statistics. 
Robert Willett, a physicist at Nokia Bell Labs 
in Murray Hill, New Jersey, says that his team 
saw “strong evidence’ for fractional statistics 
in 2013 (ref. 3). 


Quantum computing 

But some theoretical physicists say that the 
evidence in these and other experiments, 
although striking, was not conclusive. “In 
many cases, there are several ways of explain- 
ing an experiment,” says Bernd Rosenow, a 
condensed-matter theorist at the University of 
Leipzig in Germany. But the evidence reported 
by Manfra’s team, if confirmed, is unequivocal, 
Rosenow says. “I’m not aware of anexplanation 
of this experiment which is plausible and 
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does not involve fractional statistics.” 

The results potentially lay the groundwork 
for applications for anyons. Simon and others 
have developed elaborate theories for how 
anyons could be used as the platform for 
quantum computers. Pairs of the quasi- 
particles could encode information in their 
memory of how they have circled around 
one another. And because the fractional 
statistics is ‘topological’ — it depends on the 
number of times one anyon went around 
another, and not onslight changes to its path 
— it is unaffected by tiny perturbations. This 
robustness could make topological quantum 
computers easier to scale up than are current 
quantum-computing technologies, whichare 
error-prone. 

Topological quantum computing will 
require more-sophisticated anyons than those 
Manfraand colleagues have demonstrated; his 
team is now redesigning its device to achieve 
that. Still, anyon applications are some way off, 
researchers warn. “Even with this new result, it 
is very hard to see [fractional quantum-Hall] 
anyons as a strong contender for quantum 
computing,” Simon says. 

But the quasiparticles’ unique physics 
is worth exploring: “To me, as a con- 
densed-matter theorist, they are at least as 
fascinating and exotic as the Higgs particle,” 
says Rosenow. 
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News in focus 


Ascanning electron microscope image of SARS-CoV-2 coronavirus particles (orange) ona cell (blue). 


SIX MONTHS OF CORONAVIRUS: 
THE MYSTERIES SCIENTISTS 
ARE STILL RACING T0 SOLVE 


From immunity to the role of genetics, Nature looks at five 


pressing questions about COVID-19 that researchers are tackling. 


By Ewen Callaway, Heidi Ledford 
and Smriti Mallapaty 


nlate December 2019, reports emerged of 
amysterious pneumonia in Wuhan, China, 
acity of 11 million people in the province 
of Hubei. The cause, Chinese scientists 
quickly determined, was anew coronavirus 
distantly related to the SARS virus that had 
emerged in China in 2003, before spreading 
globally and killing nearly 800 people. 

Six months and more than ten million 
confirmed cases later, the COVID-19 pandemic 
has become the worst public-health crisis ina 
century. Morethan 500,000 people have died. 
It has also catalysed a research revolution, as 
researchers and doctors have worked at break- 
neck speed to understand COVID-19 and the 
virus that causes it: SARS-CoV-2. 
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They have learnt how the virus enters and 
hijacks cells, how some people fight it off and 
how it eventually kills others. They have iden- 
tified drugs that benefit the sickest patients, 
and many more potential treatments are inthe 
works. And researchers have developed nearly 
200 potential vaccines. 

But for every insight into COVID-19, more 
questions emerge, and others linger. That is 
how science works. To mark six months since 
the world first learnt about the disease respon- 
sible for the pandemic, Nature runs through 
some key questions that researchers still don’t 
have answers to. 


Why do people respond 
so differently? 


One of the most striking aspects of COVID-19 
is the stark differences in experiences of the 
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disease. Some people never develop symptoms, 
whereas others, some apparently healthy, have 
severe or fatal pneumonia. “The differences in 
the clinical outcome are dramatic,” says Kari 
Stefansson, a geneticist and chief executive of 
DeCODE Genetics in Reykjavik, which is look- 
ing for human gene variants that might explain 
some of these differences. 

That search has been hampered by the small 
number of cases in Iceland. But last month, 
a team analysing the genomes of roughly 
4,000 people from Italy and Spain turned 
up the first strong genetic links to severe 
COVID-19 (ref. 1). People who developed 
respiratory failure were more likely to carry 
one of two particular gene variants than were 
people without the disease. 

One variant lies in the region of the genome 
that determines ABO blood type. The other is 
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near several genes, including one that encodes 
a protein that interacts with the receptor the 
virus uses to enter human cells, and two others 
that encode molecules linked to immune 
response against pathogens. The researchers 
are part of the COVID-19 Host Genetics Initia- 
tive, a global consortium of groups that are 
pooling data to validate findings and uncover 
further genetic links. 

The variants identified so far seem to play 
a modest part in disease outcome. A team led 
by Jean-Laurent Casanova, an immunologist 
at the Rockefeller University in New York City, 
is looking for mutations that have a more 
substantial role. 


What’s the nature of immunity and 
how long does it last? 


Immunologists are working feverishly to 
determine what immunity to SARS-CoV-2 
could look like, and how long it might last. 
Much of the effort has focused on ‘neutralizing 
antibodies’, which bind to viral proteins and 
directly prevent infection. Studies have found? 
that levels of neutralizing antibodies against 
SARS-CoV-2 remain high for a few weeks after 
infection, but then typically begin to wane. 

However, these antibodies might linger at 
high levels for longer in people who had par- 
ticularly severe infections. “The more virus, 
the more antibodies, and the longer they will 
last,” says immunologist George Kassiotis of 
the Francis Crick Institute in London. Similar 
patterns were seen with SARS (severe acute 
respiratory syndrome). 

Researchers don't yet know what level of 
neutralizing antibodies is needed to fight off 
reinfection by SARS-CoV-2. And, ultimately, a 
full picture of SARS-CoV-2 immunity is likely 
to extend beyond antibodies. Other immune 
cells called T cells are important for long-term 
immunity, and studies** suggest that they are 
also being called to arms by SARS-CoV-2. 


Has the virus developed any 
worrying mutations? 


All viruses mutate as they infect people, 
and SARS-CoV-2 is no exception. Molecular 
epidemiologists have used these mutations 
to trace the global spread of the virus. But 
scientists are also looking for changes that 
affect its properties, for instance by making 
some lineages more or less virulent or trans- 
missible. “If it did become more severe, that’s 
something you would want to know about,” 
says David Robertson, a computational biol- 
ogist at the University of Glasgow, UK, whose 
team is cataloguing SARS-CoV-2 mutations. 
Such mutations also have the potential to 
lessen the effectiveness of vaccines, by alter- 
ing the ability of antibodies and T cells to 
recognize the pathogen. 

But most mutations will have no impact, 
and picking out the worrying ones is chal- 
lenging. Versions of the coronavirus identified 


Horseshoe bats might be the origin of the virus. 


at the start of outbreaks in hotspots such as 
Lombardy in Italy or in Madrid, for instance, 
might look as if they are deadlier than those 
found at later stages or in other locations. But 
such associations are probably spurious, says 
William Hanage, an epidemiologist at Harvard 
University’s T.H. Chan School of Public Health 
in Boston, Massachusetts: health officials are 
more likely to identify severe cases in early, 
uncontrolled stages of an outbreak. Broad 
spread of certain mutations could also be due 
to ‘founder effects’, in which lineages that arise 
early in transmission centres suchas Wuhan or 
Italy happen to have a mutation that is passed 
on when they seed outbreaks elsewhere. 
Researchers are debating whether the 
widespread prevalence of one mutation in 
the virus’s spike protein is the product of a 
founder effect — or an example of aconsequen- 
tial change to the virus’s biology. The mutation 
seems to have first emerged around February 
in Europe, where most circulating viruses 
now carry it, and it is currently found in every 
region of the world. Studies have suggested 
that this mutation makes the SARS-CoV-2 virus 
more infectious to cultured cells, but it is not 
clear how this translates to human infections. 


How well will a vaccine work? 


Aneffective vaccine might be the only way out 
of the pandemic. There are currently roughly 
200 in development. The first large-scale effi- 
cacy trials to find out whether any vaccines 
work are set to begin in the next few months. 
These studies will compare rates of COVID-19 
infection between people who get a vaccine 
and those who receive a placebo. 

But there are already clues in data from 
animal studies and early-stage human trials, 
mainly testing safety. Multiple teams have 
conducted ‘challenge trials’ in which animals 
given a candidate vaccine are intentionally 
exposed to SARS-CoV-2 to see whether the jab 
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can prevent infection. Studies in macaques 
suggest that vaccines might prevent lung 
infection and resulting pneumonia, but not 
block infection elsewhere in the body, such 
as the nose. Monkeys that received a vaccine 
developed by the University of Oxford, UK, 
and that were then exposed to the virus had 
levels of viral genetic material in their noses 
comparable to levels in unvaccinated animals>. 
Results such as this raise the possibility of a 
COVID-19 vaccine that prevents severe disease 
— but not spread of the virus. 

Data in humans, although scant, suggest 
that COVID-19 vaccines prompt our bodies to 
make potent neutralizing antibodies that can 
block the virus from infecting cells. What isn’t 
yet clear is whether levels of these antibodies 
are high enough to stop newinfections, or how 
long these molecules persist in the body. 


Whatis the origin of the virus? 


Most researchers agree that the SARS-CoV-2 
coronavirus probably originated in bats, 
specifically horseshoe bats. This group hosts 
two coronaviruses closely related to SARS- 
CoV-2. One, named RATG13, was found? in 
intermediate horseshoe bats (Rhinolophus 
affinis) in the southwestern Chinese province 
of Yunnan in 2013. Its genome is 96% identical 
to that of SARS-CoV-2. The next-closest match 
is RMYNO2, a coronavirus found in Malayan 
horseshoe bats (Rhinolophus malayanus), 
which shares 93% of its genetic sequence with 
SARS-CovV-2 (ref. 7). 

The 4% difference between the genomes of 
RATG13 and SARS-CoV-2 represents decades 
of evolution. Researchers say this suggests 
that the virus might have passed through an 
intermediate host before spreading to people, 
inthe same way that the virus that causes SARS 
is thought to have passed from horseshoe bats 
to civets before reaching people. 

To unequivocally trace the virus’s journey to 
people, scientists would need to find an animal 
that hosts a version more than 99% similar 
to SARS-CoV-2 — a prospect complicated by 
the fact that the virus has spread so widely 
among people, who have also passed it to other 
animals, such as cats, dogs and farmed mink. 

Zhang Zhigang, an evolutionary microbiol- 
ogist at Yunnan University in Kunming, says 
efforts by research groups in China to isolate 
the virus from livestock and wildlife, including 
civets, have turned up bare. Groups are also 
searching for the coronavirus in tissue samples 
from bats, pangolins and civets. 
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News in focus 


India’s epidemic adviser fears 
coronavirus crisis will worsen 


India has confirmed more than 700,000 
cases of COVID-19 so far, making it the 
world’s third-worst-hit country. Major cities 
such as Delhi and Mumbai are particularly 
badly affected, with hospitals struggling 

to accommodate critically ill patients. The 
current surge in infections follows a two- 
and-a-half-month India-wide lockdown that 
began on 25 March and severely disrupted 
the economy and livelihoods. Jayaprakash 
Muliyil, an epidemiologist at the Christian 
Medical College in Vellore in the state of 
Tamil Nadu, has been advising the Indian 
government on COVID-19 surveillance. He 
talks to Nature about some of the factors 
affecting India’s epidemic. 


Do you think the outbreak in India is 
charting a different path from outbreaks 

in other badly hit countries, such as the 
United States, Italy or Spain? 

It is. It is spreading much faster here, and 

the infection rates are higher. The general 
population’s anxiety about the disease is low. 
People will willingly go out into the market 
today, and take fewer precautions to protect 
themselves. Consequently, at least in cities, 
the epidemic is growing very rapidly. And we 
know it is spreading in rural areas, too. The 
whole trajectory of the infection is moving 
upwards more sharply than in many other 


countries. What happened in many Western 
countries is that when a big city like London 
was affected, other cities reacted strongly and 
reduced transmission. So, everywhere else, 
the doubling time got longer, but in some 
Indian cities it is short. 


India is reporting that its mortality rate 

is among the lowest in the world. Is that 
accurate? 

The mortality per million people in India is 
expected to be lower because of the low 
average age of India’s population. (Older 
people are more likely to die from this 
infection.) So, we can take some comfort in 
the fact that deaths are fewer, especially in the 
rural population. 

But the problem with death as an indicator 
is that a COVID-19 death has to be certified 
as such. The only way to do this is through 
an RT-PCR test (a reverse-transcription 
polymerase chain reaction test, which looks 
for viral genetic material in nose and throat 
samples). And with a population of 1.3 billion, 
what do you think is the proportion of people 
that has access to this kind of testing? It is 
very low. 

So, it is very difficult to count all deaths due 
to COVID-19. There is no way you can get it 
done, unless rapid tests become more widely 
available. Remember that at least half of all 
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Temperature screening in Mumbai, India, during the COVID-19 pandemic. 
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deaths will happen in the rural villages where 
around 66% of our population live. And there 
are no real mechanisms to ascertain causes 
of deaths in these villages. 


What do you think of the Indian 
government's response to the epidemic 
so far? 

The lockdown all over the country was not 
the right response. It brought misery to 
untold numbers of people and destroyed 
lives. And we haven't been able to repair 
its consequences for society. That was 
unfortunate. If we had planned the lockdown 
better, we would have still had losses, but 
they wouldn't have been greater than what 
we are experiencing now. 

The lockdown did have one benefit: 
everyone became aware of this thing called 
COVID-19. It is not easy to communicate this 
to everyone in India, with its many remote 
regions, but because of the lockdown, 
people heard about it. The concept of an 
infectious disease is not easy for many to 
understand. In many rural areas, measles 
is considered to be caused by a goddess 
visiting a village. So is chicken pox. There, 
when you introduce the term virus, it 
doesn’t make sense to many groups of 
people. 


What should cities with large outbreaks 
do now? 

Many cities are quarantining people 
returning from COVID-19-affected states 

or countries in public facilities and hotels. 

| would say that should stop, and these 
people should quarantine at home. Most of 
them won't know whether they have been 
infected, because they might not have been 
tested. And when the number of infected 
people is already high in the community, 
quarantining incoming travellers in public 
facilities, which is very labour-intensive, is 
not economical. 

Instead, we should focus on a reverse 
quarantine for elderly people — where the 
old and the vulnerable are quarantined from 
others to protect them. We should also put 
money into hospitals, and provide oxygen for 
patients. That manoeuvre will save lives. 


Interview by Priyanka Pulla 
This interview has been edited for length and 
clarity. 
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News in focus 


CANCER RESEARCHER'S 
APPOINTMENT PROMPTS 


MAJOR ROW 


Pier Paolo Pandolfi, who harassed a postdoc while 
at Harvard, will no longer lead an Italian institute. 


By Alison Abbott 


n Italian scientific institute has 

reversed its decision to appoint a 

high-profile cancer researcher — 

Pier Paolo Pandolfi — as its scientific 

director after a tumultuous month 
of protests and accusations. The row over 
Pandolfi, who admits one instance of sexual 
harassment and has been accused of scientific 
misconduct in his research papers, resulted in 
the resignation of the entire scientific board 
of the Veneto Institute of Molecular Medicine 
(VIMM) in Padua. 

Pandolfi was director of the cancer-genet- 
ics centre at Harvard University’s prestigious 
Beth Israel Deaconess Medical Center (BIDMC) 
in Boston, Massachusetts, until last year, and 
has made discoveries about the molecular 
mechanisms and genetics underlying some 
cancers. Since May, reports that he had been 
accused of sexual harassment have appeared 
online. Meanwhile, over the past seven years, 
anonymous commenters on the website 
PubPeer, which hosts discussion of published 
research results, have raised questions about 
the integrity of some of Pandolfi’s papers. 

Pandolfi admits to the inappropriate 
pursuit of a postdoc at Harvard, for which 
he says Harvard investigated him. He says it 
was an isolated incident and he has received 
counselling, and he resigned from Harvard 
last December. He denies the accusations of 
research misconductin his work, but says that 
he will review any papers under scrutiny. 


Mass resignation 

The foundation that funds VIMM chose 
Pandolfi as the institute’s scientific director 
on 20 May. That decision sparked protests by 
the institute’s entire scientific advisory board, 
which includes two Nobel prizewinners. The 
members resigned en masse on 25 June after 
the appointment was confirmed. However, 
they say that although they were aware of the 
allegations against Pandolfi, their mass resig- 
nation was protesting against the fact that they 
had not been consulted in the appointment 
procedure. Many of VIMM’s principal inves- 
tigators also said that they had not been con- 
sulted, despite the institute’s statutes requiring 
this, and said that the concerns about Pandolfi 
should have been investigated. VIMM’s interim 


182 | Nature | Vol 583 | 9 July 2020 


) 
iW) 


Cancer researcher Pier Paolo Pandolfi. 


scientific director, Luca Scorrano, resigned on 
22 June over the situation. 

Giving advice on the appointment of key 
scientific staff is a central role of such boards, 
says Wolfgang Baumeister, chair of VIMM’s 
scientific advisory board, and a director of 


“There should have 
been more investigation 
before making the 
appointment.” 


the Max Planck Institute of Biochemistry in 
Martinsried, Germany. He says it was “totally 
unacceptable” for the foundation to appoint 
Pandolfi without consulting the board. 

“There should have been more investigation 
before making the appointment,” says board 
member Aaron Ciechanover, a biochemist 
at the Technion — Israel Institute of Technol- 
ogy in Haifa who shared the 2004 chemistry 
Nobel prize. 

Under pressure from the scientists and from 
the Italian media, which has reported on the 
allegations, the foundation reversed its deci- 
sionon 30 June. Inastatement, the foundation's 
directors said that rescinding the appointment 
became “necessary, after learning the story 
in which Prof. Pandolfi was involved in Har- 
vard University, of which the Foundation had 
not been informed”. Appointing him would 
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compromise “the image and reputation of the 
Institute,” they said. 

The foundation’s president and chair of its 
executive board, Francesco Pagano, a urolo- 
gist who made the decision to appoint Pandolfi 
with the board, says that Pandolfi had not told 
him or the foundation about the allegations 
when Pagano made the appointment, and 
that he learnt about them from the press. 
Pagano says that, according to the institute’s 
statutes, the scientific advisory board is only 
aconsulting body for scientific questions, and 
that the principal investigators were consulted 
appropriately. 


Frequent e-mails 

When it comes to the harassment, Pandolfi 
says that the incident in which he pursued 
the postdoc was isolated. “It was romantic, 
not sexual — and it was the biggest mistake 
in my life,” he says. He says that an internal 
Harvard investigation, which concluded in 
July, referred him to an external service for 
evaluation and coaching. Harvard declined 
to comment on whether it had investigated 
Pandolfi because it said it does not comment 
on personnel issues, but it confirmed that he 
is no longer affiliated with the BIDMC. 

The postdoc, who asked not to be named to 
protect her career, told Nature that startingin 
the autumn of 2018, Pandolfitold her he wasin 
love with her and frequently sent her personal 
e-mails declaring his feelings. He also organ- 
ized “too many one-to-one meetings where 
he talked about his feelings for me”, she says. 
She adds that she told him to keep their rela- 
tionship professional, but to no avail. “It was 
embarrassing, horrible and I was not able to 
work.” The postdoc was transferred to a dif- 
ferent research group after she reported his 
behaviour in April 2019. 

Pandolfi says the evaluation service deemed 
him fit to resume work and oversee his research 
but recommended professional and psycho- 
logical coaching. He says the coaching, which 
he began in September, was useful. 


Image questions 


Pandolfi denies all allegations of misconduct 
in his research papers. Most allegations sug- 
gest that images of molecular assays contain 
duplication or inappropriate altering. On 
29 May, Baumeister asked for advice from 
Enrico Bucci, a science-integrity expert in 
Samone, Italy. Bucci examined 33 papers 
co-authored by Pandolfi. He considers 13 of 
these studies to have serious problems. 
Pandolfi is corresponding author on eight 
of these. 

Pandolfi says that his laboratory is 
extremely careful to maintain and present its 
data correctly. But he says he will look again 
at the papers under scrutiny and make any 
corrections that might be necessary. “I take 
this very seriously,” he says. 
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Feature 


The US rover Perseverance will land in Mars’s Jezero Crater (circled in yellow). 


ALL ABOARD T0 MARS 


The United States, China and the United Arab Emirates all plan groundbreaking 
trips to the red planet — a notoriously dangerous destination for space missions. 
By Alexandra Witze, Smriti Mallapaty and Elizabeth Gibney 


hree times in the coming month or 
so, rockets will light their engines 
and set course for Mars. A trio of 
nations — the United States, China 
and the United Arab Emirates (UAE) 
— will be sending robotic emissaries 
tothe red planet, hoping to start new 
chapters of exploration there. 
Each mission is a pioneer in its own right. The 
United States is sending its fifth rover, NASA’s 
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most capable ever, in the hope of finding 
evidence of past life on Mars and collecting a 
set of rocks that will one day be the first sam- 
ples flown back to Earth. China aims to build 
on its lunar-exploration successes by taking 
one ofits rovers to Mars for the first time. And 
the UAE will be launching an orbiter — the first 
interplanetary mission by any Arab nation — as 
atest of its young but ambitious space agency. 

Itis far from a given that all these missions 
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will make it; Mars is notorious as a graveyard 
for failed spacecraft. But if they do, they will 
substantially rewrite scientific understanding 
of the planet. The two rovers are heading for 
parts of Mars that have never been explored, 
and the UAE’s orbiter will track the changing 
Martian atmosphere (see ‘Mars invasion’). 
The teams behind the missions have 
managed to keep their projects ontrack despite 
the coronavirus pandemic that has derailed 
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so many other plans, including a European- 
Russian Mars mission that has been delayed 
bytwo years. When the three craft lift offinthe 
next few weeks, they will give residents of Earth 
a chance — however briefly — to look upwards 
and beyond the problems at home. 


NASA’s hunt for rocks 


NASA hopes that its mission to Mars — a six- 
wheeled, three-metre-long rover named 
Perseverance — will be the start of a much 
bigger journey. If all goes to plan, Perseverance 
will extract and store samples of Martian rocks 
that a future mission will one day pick up and 
bring back to Earth, possibly by 2031. It would 
be the first-ever sample return from Mars. 

That means the stakes for Perseverance 
are sky-high. NASA’s four previous Mars 
rovers — 1997’s Sojourner; 2004's Spirit and 
Opportunity; and 2012’s Curiosity — were all 
about exploration. Mission controllers could 
take their time driving those machines, sidling 
them upto interesting-looking rocks or setting 
them off across vast plains. But Perseverance 
will arrive at Mars with the focused task of 
identifying and collecting a broad range of 
rocks representing the geological history of 
the area. And itis supposed to fulfil that mission 
in one Mars year — nearly two Earth years. 
Whatever the rover picks up will help to shape 
the course of Mars science for decades to come. 

Most significantly, Perseverance represents 
the best chance yet for scientists to learn 
whether life ever arose on the red planet. If it 
collects the right kinds of rock, then scientists 
in laboratories back on Earth might be able to 
tease out signatures of Martian life. 

“This mission gives us the first opportunity 
to take fundamental questions about whether 
there was or wasn’t life on Mars to the next 
level,’ says Sherry Cady, an astrobiologist at 
the Pacific Northwest National Laboratory 
in Sequim, Washington, who is not directly 
involved with the mission. 

Perseverance will do this with a suite of 
scientific instruments for poking and probing 
the Martian surface and atmosphere. It is a 
familiar-looking rover — basically acopy of the 
Curiosity rover that has been exploring Gale 
Crater for the past eight years. NASA's goal was 
to save money by using the same design with 
some tweaks, such as adding asystem to store 
samples and upgrading the wheels. Despite its 
aim to cut costs, the mission’s price tag has 
risen to US$2.7 billion, nearly $360 million 
over budget, because of problems developing 
some of the instruments. 

The rover carries advanced versions of 
some of Curiosity’s sensors, including a 
chemical analyser that blasts rocks with a 
laser to identify the atoms and molecules 
they are made of, and a sharp-eyed cam- 
era system that can zoom in on areas of 
interest to produce stereo and 3D pictures. 
Perseverance also sports an experiment that 


will try to produce oxygen from Mars’s carbon 
dioxide-rich atmosphere, as a test of ways to 
support future human explorers. The rover 
has X-ray and ultraviolet spectrometers for 
analysing mineralogy in detail — and, for a bit 
more novelty, microphones for listening to 
Martian sounds, plus a squat, solar-powered 
helicopter. 

Then there is the sampling system, 
which engineers designed from scratch. 
Perseverance carries 43 tubes in its belly. When 
it encounters a rock that mission scientists 
want to sample, the rover will reach out its 
2.1-metre-long robotic arm and drill a sample 
about the size of a penlight: 60 millimetres 
long and 13 millimetres across. The sample 
goes into atube and is sealed. Eventually, once 
Perseverance has filled at least 20 of its tubes, 
it will cache them on the surface of Mars until 
some future, yet-to-be-funded robot arrives 
to retrieve them. NASA currently plans to 
work with the European Space Agency (ESA) 
to launch a mission in 2026 that would return 
the rocks to Earth in 2031. 

Perseverance will land in the 
45-kilometre-wide Jezero Crater, just north 
of the Martian equator and ina spot that was 
once home to a lake and a river delta. That 
ancient delta offers a rich variety of geological 


“We'll be able to cover all 
of Mars, through all times 
of day, through an entire 
Martian year.’ 


landscapes — where Perseverance could collect 
many samples that might contain signs of past 
life, says Kennda Lynch, a planetary scientist at 
the Lunar and Planetary Institute in Houston, 
Texas, who has studied the Jezero landing site. 
Engineers at the Jet Propulsion Laboratory in 
Pasadena, California, which built the craft, have 
already mapped out several routes that it could 
take around the delta, covering onthe order of 
15 kilometres. Cady says the rover will do best 
if it first rolls around the region to survey the 
landscape, then returns to collect samples. 

Perseverance is scheduled to launch from 
Cape Canaveral Air Force Station in Florida 
between 30 July and 15 August, and land on 
Mars on 18 February 2021. 


China’s Mars debut 


China has ambitious plans for its first 
exploration of Mars. An orbiter, lander and 
rover packed with 13 scientific instruments are 
set to launch from an island in southern China 
in late July. The mission, named Tianwen-1, 
which means ‘quest for heavenly truth, will 
be China’s deepest probe into space. When it 
arrives, in February next year, the mission will 
aim to conduct acomprehensive survey of the 
planet’s atmosphere, internal structures and 
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surface environment — including searching for 
the presence of water and signs of life. 

A previous attempt by China to send an 
orbiter to Mars, aboard a Russian spacecraft 
in 2011, ended with the probes going missing. 
But after that loss, China has racked up astring 
of wins in space. In 2013, it became the third 
country to land a spacecraft onthe Moon. And 
last year, a Chinese lander touched down on 
the Moon’s far side — the first one from any 
country to do so. In May, China successfully 
test-launched a spaceship that will shuttle 
crew to the country’s new space station — 
expected to be finished in 2022. 

But the Mars project is in a different league 
from China’s previous space missions, 
researchers at the China Academy of Space 
Technology in Beijing said in a 2017 paper 
(P. J. Ye et al. Sci. China Technol. Sci. 60, 
649-657; 2017). The voyage to Mars is 1,000 
times longer than that to the Moon, and the 
planet has twice the surface gravity, an atmos- 
phere and is littered with dense rock, which 
makes the effort much more risky. 

The Chinese government has been tight- 
lipped about the mission: most public 
information has come from published 
articles and state-media reports, which omit 
key details about its budget, the exact launch 
date and where the probe will land on the 
planet. Scientists involved with the mission 
have declined interview requests until after 
the launch. But Wang Chi, aspace physicist and 
director-general of the National Space Science 
Center (NSSC) in Beijing, said in an e-mail that 
the missionis moving forward as planned. “Our 
teamis working inthe Wenchang launchcentre 
right now, and everything goes smoothly,” he 
said, referring to the facility on Hainan Island. 
Wangis responsible for the scientific payloads 
involved in the mission, which is being led by 
the China National Space Administration. 

Ifeverything goes as planned, Tianwen-1 will 
be the first mission to successfully study the 
red planet with an orbiter, lander and rover. 
Once the combined craft reaches Mars, the 
hexagonal orbiter will release the lander and 
rover — protected by a spherical cone — into 
the Martian atmosphere. The Chinese team has 
identified two potential landing areas north of 
the equator on the plains of Utopia Planitia, 
according to a presentation made by Wei Yan 
of the National Astronomical Observatories in 
Beijing, who spoke at the European Planetary 
Science Congress in Geneva, Switzerland, last 
September. 

The probe will parachute and then hover 
to the ground, settling on the circular 
lander’s four legs. The rover, weighing some 
200 kilograms, will then extend its solar pan- 
els, drive down a ramp and begin to autono- 
mously explore its surroundings for the rest 
of its lifetime of around 90 Martian days, each 
of which lasts 24 hours and 37 minutes. During 
the rover’s mission, the orbiter will act as a 
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Three spacecraft | 
heading to the red 
planet this year will send 
back an unprecedented 
stream of information 
about the alien world. 


Never before will such a diverse array of 
scientific gear have arrived at a foreign 
planet at the same time, and with such 
broad ambitions. Missions from China, the © 
United States andthe United ArabEmirates [ee 
(UAE) will include two orbiters, two rovers, ; 
a Stationary surface laboratory and evena 
helicopter. They aim to study everything 
from Mars’s buried water deposits to the 
top of its atmosphere, with a particular 
focus on the search for life. 


Landing sites 


A US rover called Perseverance will land in Jezero Crater, 
near a delta formed by an ancient river — a prime location 
for finding signs of past life if it existed. China is considering 
several landing sites for its Tlanwen-1 mission. 
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HOPE 


The UAE mission will travel around 
Mars in an elliptical orbit ranging from 
about 22,000 to 44,000 kilometres. It 
carries two spectrometers and a 
high-resolution imager to capture 
information about how the atmosphere 
changes over the day and throughout 


= the seasons. 
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The US rover is a car-sized vehicle 
packed with seven instruments. Its main 
task is to collect rock samples destined 
to be carried back to Earth in a future 
mission. It will also study the planet's 
weather and geology, hunt for water, 
produce oxygen from carbon dioxide, 
record sounds for the first time and test 
a solar-powered helicopter. 
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The escaping atmosphere 


Mars once had a thick atmosphere and 
a significant amount of liquid water on 
the surface, but much of the 
atmosphere has leaked away over 
billions of years. Hope will assess how 
oxygen and hydrogen atoms and ions 
are escaping into space. 
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China’s pioneering mission to Mars will 
carry an orbiter, rover and lander — it 
would be the first nation to achieve all 
three. Both the rover and orbiter have 
radar instruments for spotting water 


and ice on the surface and 


underground. They will also study the 
planet’s geology and weather. 
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The robot geologist 


Perseverance carries 43 tubes to hold rock samples 
collected and stored by a series of 3 robots. When 
the samples are eventually returned to Earth, they 
could provide the first definitive evidence of whether 
life once existed on Mars. 


1The 
2.1-metre-long 
robot arm drills 
a thin sample 
of rock. 


Robotic arm 
with drill 


2 The arm delivers the 
sample to a carousel, 
which moves the sample 


3 A second robotic arm 
carries the sample to 
different instruments for 
initial measurements, then 
seals the tube. 


Sealing, volume 
and vision stations 


Sample 


4 4 future mission will aim to 
retrieve the cached samples 
and send them back to Earth. 


Orbiter 


Search for life 


Perseverance has several 
instruments that will hunt for 
evidence of life. One of those is 
SHERLOC, which illuminates rocks 
with an ultraviolet laser and 
records spectra of the 
luminescence and reflectance. It 
can identify the signal of organic 
molecules and minerals that 
formed in watery environments. 
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communication link, and then will move into 
closer orbit to survey the planet for an entire 
Martian year. 

The Chinese team has fitted the orbiter with 
eight instruments, and the rover with five. 
The subsurface radar on the orbiter can peer 
100 metres deep to map geological structures 
and search for water and ice. Medium- and 
high-resolution cameras will collect images of 
features suchas dunes, glaciers and volcanoes, 
providing clues to how they formed. Both the 
orbiter and rover will carry spectrometers 
to study the composition of soil and rocks, 
looking especially for evidence of how water 
has altered geological features. The team 
also plans to collect atmospheric data on 
temperature, air pressure, wind speed and 
direction, as well as study the magnetic and 
gravitational fields on Mars. 

Similar instruments have been sent on 
previous missions to Mars, says Raymond 
Arvidson, a planetary geologist at Washington 
University in St Louis, Missouri. But Mars 
is big, and has a complicated geological 
history, so the data generated from Tianwen-1 
could inform researchers’ understanding 
about locations not covered by existing 
observations, he says. “If the Chinese instru- 
ments work, produce data, and the data are 
shared in a manner similar to what we do, it 
will all be worth the effort,” says Arvidson, 
referring toa free public archive of geoscience 
data collected from many previous planetary 
explorations that is managed by his university 
and NASA. 

Dmitrij Titov, project scientist for ESA’s 
Mars Express orbiter, which launched in 2003, 
says the Chinese orbiter could outlast some 
of the veterans that might be nearing the end 
of their life, including Mars Express, NASA’s 
Mars Reconnaissance Orbiter and the Mars 
Atmosphere and Volatile Evolution orbiter, 
known as MAVEN. Continuous monitoring of 
the planet will benefit the community at atime 
when many other space agencies will be busy 
building sample-return missions, says Titov. 
In fact, China has its own plans to collect and 
bring back samples from Mars by 2030. 


UAE’s interplanetary hope 

The United Arab Emirates had big dreams 
when it decided to shoot for Mars with its 
first probe to go beyond Earth orbit. So it 
chose the name ‘Hope’ for the orbiter, which 
is set to blast off from the Tanegashima Space 
Centre in Kagoshima, Japan, during a three- 
week window starting on 15 July. 

If successful, the Emirates Mars Mission 
(EMM) will not only mark the first interplan- 
etary venture of any Arab nation, but also 
produce the first global weather map of Mars. 
Although previous probes built up a picture 
of the planet’s atmosphere from orbits that 
allowed them to monitor each part of the 
planet at limited times of day, Hope’s huge 
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Perseverance will explore an an 


cient river delta where water once flowed on Mars’s surface. 


elliptical orbit will enable the orbiter to 
observe big chunks of Mars under both day- 
and night-time conditions, covering almost 
the entire planet in each 55-hour orbit. “We'll 
be able to cover all of Mars, through all times 
of day, through an entire Martian year,’ says 
Sarah Al Amiri, science lead for the project and 
the country’s minister for advanced sciences. 
The probe’s visible-light camera and infrared 
spectrometer will study Martian clouds and 
dust storms in the lower atmosphere. Its ultra- 
violet spectrometer will monitor gases in the 
upper atmosphere. “This is the first mission 
that will give a global picture of the dynam- 
ics of the Mars atmosphere,” says Hessa Al 
Matroushi, amember of the EMM science team. 

During its two-year mission, Hope will track 
daily weather variations and changing seasons. 
As wellas helping to prepare for future human 
missions, it should reveal how atmospheric 
conditions cause hydrogen and oxygen to 
escape into space. This could help scientists 
to understand Mars’s climate and how it lost 
its once-thick atmosphere. The team worked 
with international collaborators to come up 
with its science goals, and the data will be 
made available to the international commu- 
nity with no embargo period, says Al Amiri. 
“The Emiratis were very keen to make this not 
just a technology demonstrator, but make it 
contribute to the scientific understanding 
of Mars,” says Richard Zurek, who is the chief 
scientist for the Mars Program Office at NASA's 
Jet Propulsion Laboratory. 

An interplanetary spacecraft was a 
significant leap in capability for the UAE, which 
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hired seasoned engineers from previous NASA 
missions, mainly at the University of Colorado 
Boulder. The partnership has an explicit goal 
of transferring know-how to the team at the 
Mohammed Bin Rashid Space Centre, with 
whom the engineers worked on each element 
of the mission. “The reality is we are a young 
country and we couldn't do anything that 
we did without partners and international 
collaboration,” says Ahmad Belhoul, minister 
for higher education and chair of the UAE Space 
Agency. 

Unusually for aninterplanetary project, the 
idea for the mission came not from scientists 
but from the government itself — and with 
a non-negotiable deadline of 2 December 
2021, the country’s 50th anniversary. Picking 
such an audacious task was designed not 
only to inspire young people in the region 
but also to kick-start the UAE’s move to a 
knowledge-based economy, says Omran 
Sharaf, project director for the EMM. 

And the mission is already having an impact, 
with universities offering five new undergrad- 
uate courses in pure sciences and growing 
enthusiasm for space among Emirati children. 

In many ways, even if Hope blows up onthe 
launch pad, the mission would be a success, 
says Al Amiri, who quickly reconsidered 
that point. “My heart just skipped a beat just 
thinking about it.’ 


Alexandra Witze writes for Nature from 
Boulder, Colorado. Smriti Mallapaty is a senior 
reporter in Sydney, Australia, and Elizabeth 
Gibney is a senior reporter in London. 
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A woman sells face masks in Mexico City. 


Women are most affected by pandemics 
— lessons from past outbreaks 


Clare Wenham, Julia Smith, Sara E. Davies, Huiyun Feng, Karen A. Grépin, 
Sophie Harman, Asha Herten-Crabb & Rosemary Morgan 


The social and economic 
impacts of COVID-19 fall 
harder on women than on 
men. Governments need to 
gather data and target policy 
to keep all citizens equally 
safe, sheltered and secure. 
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omen are affected more than 
men by the social and economic 
effects of infectious-disease 
outbreaks. They bear the brunt 
of care responsibilities as schools 
close and family members fall ill'?. They are 
at greater risk of domestic violence’ and are 
disproportionately disadvantaged by reduced 
access to sexual- and reproductive-health 
services. Because women are more likely than 
men to have fewer hours of employed work 
and be on insecure or zero-hour contracts, 
they are more affected by job losses in times 
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of economic instability’. 

There has been a “horrifying global surge 
in domestic violence” since the start of the 
COVID-19 lockdowns, said United Nations sec- 
retary-general Antonio Guterres in early April. 
Malaysia, for example, reported 57% more 
calls to domestic-abuse helplines between 
18 March and 26 March. Moreover, sexual- 
and reproductive-health clinics are closing 
worldwide. Some US states have restricted 
access to abortions’. 

It is all too familiar. During outbreaks of 
Ebola and Zika viruses in the past few years, 
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women’s socio-economic security was 
upended’, and for longer than men’s". During 
the West African Ebola outbreak of 2014-16, 
for example, quarantines closed markets for 
food and other items. This destroyed the live- 
lihoods of traders in Sierra Leone and Liberia, 
85% of whom were women*. Men lost jobs, too, 
but 63% had returned to work 13 months after 
the first case was detected. For women, the 
proportion was 17% (ref. 2). 

At the same time, too little is known about 
the differential impacts of outbreaks on men 
and women. And that can leave political and 
policy responses flying blind. Only a minor- 
ity of governments collect and share basic, 
disaggregated sex and gender data on cases 
of infectious disease and the socio-economic 
impacts of the response to outbreaks. Analy- 
sis remains high level, often conducted after 
the fact and with incomplete information 
(go.nature.com/2a9gtja). This time, gaps must 
be plugged. 

Here, we call for COVID-19 research, 
response and recovery efforts that are tailored 
to support women (see ‘How to minimize the 
gendered impact of COVID-19’). The three pri- 
orities are to tackle domestic violence; ensure 
access to sexual- and reproductive-health 
services; and support women’s livelihoods. 

We recognize that gender is neither binary 
nor fixed; that the pandemic differentially 
affects non-binary and transgender people 
(go.nature.com/2zym8jc); and that gender in 
global health intersects with other social strat- 
ifiers suchas ethnicity and race, religion, loca- 
tion, disability and class®. Therefore, beyond 
what we set out here, efforts to reduce the 
differential effects of COVID-19 must explore 
these intersections of marginalization and 
vulnerability. 


Domestic violence 


Domestic abuse has increased around the 
world since social isolation and lockdown 
measures for COVID-19 began’, affecting 
women and girls more than men’. In March, 
the media reported that a woman was killed at 
the hands ofa partner every 29 hours in Argen- 
tina — that’s around 4 more women than the 
monthly average (go.nature.com/3evkopw). 
The official statistics are yet to be reported, 
and are often unreliable because reports tend 
to omit the victim-perpetrator relationship 
and motive’. With isolation measures restrict- 
ing the movement of women and their privacy, 
many will be struggling to access help. Cases 
of domestic abuse are likely to increase as 
COVID-19 continues and data are collected’. 
Similar patterns emerged in previous health 


Howto minimize the 


gendered impact of COVID-19 


Steps must be taken at three stages on 
domestic violence, sexual and reproductive 
health, and jobs. 


Before. States must learn from problems and 
solutions during previous outbreaks, and 
from the first wave of COVID-19. In May, the 
World Health Organization issued a briefing 
document considering the gendered effects 
of COVID-19 (go.nature.com/3hubc4k). 
It must follow up with guidelines for best 
practice. 

Such guidance should be integrated 
into domestic preparedness strategies, 
detailing which budget lines and indicators 
to track in national data sets, such as 
disaggregated case rates, morbidity, 
mortality, unemployment, crime and so 
on. For example, of people who have died 
from COVID-19 in the province of Quebec, 
Canada, 54% are women, where they make 
up the majority of care workers and care- 
home residents. This differs from global 
statistics, which show more deaths in men. 
In Kenya, a survey found that more women 
than men reported a complete loss of 
income or employment”. Can other nations 
adapt their policies accordingly? 


During. Policymakers must accept that 
outbreaks affect groups differently. 
Governments must collect intersectional 
gender-disaggregated data across every 
aspect of the national response, from 
incidence and death rates, social protection 


crises. During the 2014 Ebola outbreak in 
Guinea, sexual and gender-based violence rose 
by 4.5% compared with pre-outbreak levels, 
according to the country’s minister of social 
action, women and children. Last year, astudy 
in Ebola-affected regions of the Democratic 
Republic of the Congo (DRC) showed that 
women and girls reported increases in sex- 
ual and domestic violence after the outbreak 
started in 2018 (go.nature.com/3duubsx). 
Countries’ efforts on the issue in the current 
pandemic vary widely. Insome, it has not been 
addressed at all — in Kazakhstan, for example, 
where domestic violence is not a criminal 
offence”. And Hungary declared in May that 
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and employment schemes, to accessing 
non-pandemic-related health services. Rapid 
multidisciplinary research on the gendered 
impact of the virus must be funded and fast- 
tracked into policy and strategy, and must 
be supported during the recovery phases. 
Governments must fund organizations 
supporting and studying those at risk from 
domestic violence and survivors. Sexual 
and reproductive health must be prioritized, 
protected and studied. Government 
policies to support livelihoods should be 
unconditional and broad-based, sensitive to 
the different impacts on men and women, 
and iterate as information and the situation 
changes. 


After. Gender must be central to lessons 
learnt for recovery and future pandemic 
preparedness. Transition planning 

must appreciate the wider impacts on 
domestic abuse, livelihoods and sexual 
and reproductive health. For example, 
governments should consider how staged 
return-to-work policies make women or 
men more vulnerable to a second wave 

of infection, and how the rapid lifting of 
lockdown measures might see a surge in 
demand from women seeking help over 
domestic abuse. Any long-term recovery 
must consider the potential consequences 
of the depression on the more limited 
employment opportunities for women, the 
lower value put on their labour and their 
economic autonomy. 


it would not ratify the Istanbul Convention 
targeting violence against women, leaving 
women without protection from domestic 
abusers (go.nature.com/Z3ewmmpg). 

By contrast, other nations braced for the 
onslaught. Italy increased the number of 
domestic-abuse helplines and set up clan- 
destine notification protocols at pharmacies 
(go.nature.com/2vfxj5f). Australia boosted 
funding for anti-violence organizations, 
including those that offer safe accommoda- 
tion. Kenya bolstered telephone counselling 
services for those facing domestic violence 
or the threat of it (go.nature.com/3dbvubn). 

To identify where such interventions can 
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Messages against domestic violence hang outside an apartment block in Lebanon. 


prevent most harm, there is an urgent need to 
collect data using a variety of methods. This 
should be done during and after the outbreak, 
and should focus on what causes violence and 
where. Because domestic violence is widely 
under-reported, innovative methods are 
required. 

Data gathering poses many challenges, 
particularly during a crisis such as COVID-19. 
Governments and researchers must work with 
survivor organizations to understand trends 
and impacts, changes in contexts and the 
socio-political dynamics. For example, how 
are levels of violence changing in response 
to lockdown or unemployment? To capture 
the stories of women affected by violence, 
whose experiences might not be apparent in 
official statistics, researchers will need to use 
qualitative methods such as interviews with 
community leaders, health-care providers 
and the women themselves”. Examples of best 
practices should be identified and shared to 
inform future responses to outbreaks. 


Sexual and reproductive health 


Global health emergencies limit and disrupt 
sexual- and reproductive-health services; 
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COVID-19 is no different. This dangerous 
curtailment of women’s rights and well-being 
slows progress towards achieving the UN 
Sustainable Development Goal on gender 
equality. Yet, as of 9 June, the World Health 
Organization’s COVID-19 Strategic Prepar- 
edness and Response Plan had provided no 
recommendation on how resources should 
be channelled to provide safe abortion and 
ensure the supply of contraceptives. 

With governments left to chart their own 
paths, the consequences have been grim. Con- 
traceptives are still out of stock in Indonesia, 
Mozambique and many other countries. Abor- 
tions in Italy were cancelled, and are still not 
happening in some hospitals. Coupled with the 
increase in sexual violence and domestic abuse 
that happens in outbreaks, these problems 
reduce the autonomy and self-determination 
of women and girls, and can damage their 
health and well-being. 

After the Ebola outbreak in Sierra Leone 
in 2014, some studies estimated that teen- 
age pregnancies were 23% higher than in the 
previous year”. Restrictions on abortions do 
not necessarily limit demand”. Driven under- 
ground, these services become unsafe. During 
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the 2016 Zika outbreak — a virus that affects 
fetal development, manifesting in babies 
with abnormally small heads, or microceph- 
aly — no national policy changed to increase 
access to reproductive-health services™. As a 
consequence, women in the Zika epicentre — in 
Brazil, Colombia and El Salvador — told us, in 
work currently under peer review, that they 
sought unsafe abortions through providers 
they found online, feminist groups and the 
black market. Because abortion is illegal in 
most states where Zika was prevalent, there 
are no official statistics on it. 

Government policies on abortion during the 
current pandemic differ widely, and willleadto 
different outcomes for women. For example, 
England changed its legislation in March to 
permit medical abortion at home through the 
use of pills (mifepristone and misoprostol) to 
terminate pregnancy after online consulta- 
tion with a physician. Conversely, the states 
of Texas, Ohio, lowa, Oklahoma and Alabama 
have further restricted access to abortion, 
deeming it anon-essential service’. 

The family-planning organization Marie 
Stopes International estimates that there 
could be up to 2.7 million extra unsafe 
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abortions performed as a consequence of 
COVID-19. 

Inthe short term, policymakers should take 
three urgent steps. First, they should make 
contraceptives freely available at pharmacies. 
Second, they should permit medical abortions 
at home, in consultation online with a health 
professional. Third, policymakers should 
develop a minimum initial service package for 
sexual and reproductive health to be imple- 
mented at the start of every humanitarian 
crisis. It should ensure access to contracep- 
tion, obstetric and newborn care, and safe 
abortion care. 

The package should be implemented at 
the start of every humanitarian crisis. The 
increases in sexual and domestic violence 
during the DRC Ebola outbreak reveals the 
difficulties of prioritizing sexual and repro- 
ductive health during emergencies, when 
health-care systems are already strained. The 
Inter-Agency Working Group on Reproductive 
Health in Crises in New York City details what 
governments and donor organizations should 
provide to women and girls to meet reproduc- 
tive-health needs. For example, women are 
more likely to use services in locations that are 
less risk-prone, such as in community centres, 
rather than in hospitals, which are often seen 
as disease hotspots. 

In the longer term, researchers should 
consider the effects of reduced access to sex- 
ual- and reproductive-health services dur- 
ing the pandemic. Comparing how women 
engage with services during a crisis and 
normal periods can help to analyse fertility 
rates or barriers to health care. For example, 
women changed their reproductive decisions 
because of the risks posed by Congenital Zika 
Syndrome, but this was not uniform across 
society. Fertility declined more in higher 
socio-economic groups than in low-income 
groups”. Such insights allow governments to 
target programmes to where they are most 
needed. 


Livelihoods 


COVID-19 is decimating livelihoods across 
the world. The Organisation for Economic 
Co-operation and Development, the African 
Union and the International Monetary Fundall 
predict potentially frightening consequences 
for national, regional and global economies. 
By 27 March, 84 countries had adopted fiscal 
measures to mitigate the economic effect on 
households”®. By 12June, the number had risen 
to 195. Most governments increased either the 
coverage or payout amounts from existing 
social-protection schemes. Forty-seven coun- 
tries have made cash-transfer programmes 
more flexible by waiving conditions such as 
the requirement for children to attend school 
and for women to attend ante- and postnatal 
appointments (such as in the Philippines). 
Some, such as Armenia, have provided home 


delivery of payments for elderly people. And 
64 governments have amended unemploy- 
ment benefits; 49 have adopted paid sick-leave 
interventions”. 

So far, only 16 countries have reported new 
or amended social-protection measures that 
make reference to women. Pakistan, for exam- 
ple, has increased cash transfers to women who 
are already receiving financial assistance from 
the state. Algeria has introduced paid leave for 
women who are pregnant, have chronic dis- 
eases or are taking care of children. Togo is pro- 
viding women with US$21 per month, whereas 
men receive $17: President Faure Gnassingbé 
specified in April that this was because women 
are “more directly involved in nurturing the 
entire household”. Canada has increased its 
national childcare benefit, which is directed 
to mothers unless otherwise requested. These 


“With governments 

left to chart their own 
paths, the consequences 
have been grim.” 


policies recognize the specific and increased 
burden that COVID-19 is having on women 
because of social expectations around caring 
responsibilities. 

Yet, most countries’ interventions overlook 
the fact that the economic consequences are 
likely to be worse for women. 

Measures do not sufficiently cover workers 
in the gig or informal economy, such as street 
vendors or those on zero-hour contracts. 
They are at particular risk, because they lack 
the social protections of those who are for- 
mally employed. In particular, in low- and 
middle-income countries, 92% of women 
and 87% of men work in the informal econ- 
omy”. Although the difference between 
these proportions is small, women tend to 
work in positions that leave them more open 
to exploitation and abuse, suchas in domestic 
work, home-based work or by contributing to 
family businesses”. High-income countries 
are not immune to these trends: data from 
the European Institute for Gender Equality 
suggest that 26.5% of women employees 
in the European Union work in precarious 
employment, compared with 15.1% of men 
(go.nature.com/3eaabbt). 

Women have higher representation in the 
sectors that are now laying off employees, such 
as hospitality, travel, education and retail (see, 
for example, go.nature.com/2zalzme). Many 
women have had to stop any casual work to 
meet care duties during lockdown. 

Much broader measures are urgently 
needed for these workers and their families. 
By 22 May, just 94 of the 190 countries or 
regions for which information was availa- 
ble had reported commitments to support 
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informal workers financially, leaving millions 
at risk’®. Spain has committed to a universal 
basic income that will protect all workers. By 
contrast, Hong Kong gives universal payments 
only to permanent residents. This will not 
cover the 5% of the city’s population who are 
migrant domestic workers — mainly women’®. 
Australia’s JobKeeper programme pays wage 
subsidies to salaried employees during the 
pandemic, but not to casual workers — who are 
more often women (go.nature.com/2zalzme). 

Inthe short term, governments should focus 
on help for informal and casual workers. For 
example, removing requirements that a per- 
son must have had previous taxable income 
to benefit from COVID-19-related relief, and 
ensuring that unemployment benefits and 
statutory sick pay meet basic needs. 

This is also a time for innovation. New 
Zealand, for example, is suggesting a four-day 
working week to mitigate rising unemploy- 
ment, to support a better work-life balance 
and to boost local tourism. The idea comes 
from a well-being budget that it introduced 
last year (go.nature.com/2bjtiqa). 

To inform the long road out of this global 
depression, we need to monitor the real- 
world impact of policies on the hardest hit in 
real time, so that strategies can be adjusted 
if necessary. Such research requires sex-dis- 
aggregated data on the workforce. The UK 
government, for example, suspended col- 
lection of data on the gender pay gap during 
the pandemic because it was deemed non-es- 
sential. Such information is more crucial now 
than ever. 


Context is key 
Broad-brush comparisons of vulnerabilities 
to COVID-19 responses are to be treated with 
caution. Gender and its impacts are con- 
text-specific, and vary between and within 
countries. The data collected in other health 
emergencies in Liberia, Yemen or Brazil can 
suggest trends. But data sets are often incom- 
plete, and the nuances are highly dependent 
onrace, religion, ethnicity, location, disability 
and class®. Addressing some of the issues that 
women face in outbreaks highlights a broader 
landscape of inequalities. Policymakers must 
consider and supportall those at the margins. 
Our critics might advocate for other 
priorities. We're calling on governments to use 
evidence to ensure that all their citizens have 
an equal chance of safety, shelter and secu- 
rity. And when the pandemic ends, addressing 
gender inequality must be at the heart of the 
broader programme to ‘build back better’. 
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Sustainable Development 
Goals: pandemic reset 


Robin Naidoo & Brendan Fisher 


COVID-19 is exposing the 
fragility of the goals adopted 
by the United Nations — two- 
thirds are now unlikely to 

be met. 


s COVID-19 batters the world and its 

economy, it’s time to rethink sus- 

tainable pathways for our planet. 

Rosy hopes that globalization and 

economic growth would bankroll 
waves of green investment and development 
are no longer realistic. It’s unlikely there will 
be enough money or attention to banish pov- 
erty and inequality, expand health care and 
overturn biodiversity loss and climate change, 
all by 2030. 

The SARS-CoV-2 virus has already killed 
more than 512,000 people, disrupted the live- 
lihoods of billions and cost trillions of dollars. 
A global depression looms. The United States 
and other nations are gripped by protests 
against structural inequality and racism. And 
geopolitical tensions between superpowers 
and nuclear states are at levels not seen for 
decades. 

Things were different back in 2015, when 
the United Nations adopted 17 Sustainable 
Development Goals (SDGs) to improve peo- 
ple’s lives and the natural world by 2030. It was 
arguably one of humanity’s finest moments 
— the whole planet signed up. Many national 
budgets were flush with funds. Governments 
agreed ambitious treaties, including the Paris 
climate agreement, the Sendai framework on 
disaster risk reduction and the Addis Ababa 
plan for financing development. 

Five years on, as the UN celebrates its 75th 
anniversary, that mood of optimism has gone. 
In other words, the very foundations on which 
the SDGs were built have shifted. 

The success of the SDGs depends on two 
big assumptions: sustained economic growth 
and globalization. COVID-19 has torn these 
to shreds. The global economy is expected to 
contract by at least 5% this year, and the time- 
frame for its recovery is years, not months, 
if the past is any guide. Industrialized coun- 
tries struggling to support their own citizens 
will not bankroll the development of others. 
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Overseas development aid could drop by 
US$25 billion in 2021. The United States has 
announced its withdrawal from the World 
Health Organization. Increasing the scale of 
human activity on the planet looks foolish 
when it could open wells of new diseases once 
hidden in the wild, similar to COVID-19. 

Governments have basic worries. Food 
security is under threat, because farm work- 
ers are unable to travel to harvest crops; prices 
of rice, maize (corn) and wheat are rising. The 
UN World Food Programme has just doubled 
its estimate of the number of people who are 
likely to face acute food shortages this year, 
to 265 million. Demand for cash crops, such as 
Kenya’s flower exports, has stalled. Ecotourism 
has collapsed. Even oil-rich developing coun- 
tries such as Nigeria, Africa’s most populous 
nation, cannot sell their resources profitably 
in the global slowdown. 

And the world will face further stressors in 
the next decade. More pandemics, yes, but also 
extinctions and the continued degradation 
of the ecosystems on which all life depends. 
Storms, wildfires, droughts and floods will 
become more frequent owing to climate 
change. Geopolitical unrest might follow. 
Mounting costs to address these will divert 
yet more funding from existing SDG targets. 
Last year alone, the United States experienced 
14 separate billion-dollar disasters related to 
climate change. 

COVID-19 is demonstrating that the SDGs 
as currently conceived are not resilient to 
such global stressors. As the UN’s High-level 
Political Forum on Sustainable Development 
meets (virtually) this week, delegates must 
chart anew course for the SDGs. As the world 
recovers from this pandemic, the forum must 
establish a few clear priorities, not a forest of 
targets. It should also consider which goals can 
be achieved in a less-connected world witha 
sluggish global economy. 


Slow or worse 


Progress across the SDGs was slow even before 
COVID-19. Now, it’s even more likely that many 
of the 169 targets will not be met by 2030. 
Worse, some could even be counterproduc- 
tive (see ‘COVID-19 impacts on Sustainable 
Development Goals’). Two-thirds of the 169 
targets are either under threat as a result of 
this pandemic or not well-placed to mitigate 
its impacts (see Supplementary information). 
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Migrant labourers in Uttar Pradesh protest against the lack of food in a slum area after the government eased lockdown measures. 


Some might even amplify problems. Ten per 
cent of the SDG targets could worsen the 
impacts of future pandemics. 

The goal of good health is the most obvious 
casualty. Clinics everywhere are stretched by 
COVID-19, and resources are in short supply. 
The impacts are spreading to all areas of health 
care. For example, the UN children’s charity 
UNICEF warns that up to 116 million newborns 
and mothers will experience inadequate 
services in the coming months. 

With travel offthe table, tourism is suffering. 
The Organisation for Economic Co-operation 
and Development (OECD) estimates that the 
number of international tourists will drop by 
60% this year, reducing tourism’s contribution 
to global gross domestic product (GDP) and 
affecting countries where it is a substantial 
part of the national economy. These include 
Namibia, where tourism centred around char- 
ismatic species suchas lions, leopards and ele- 
phants contributes 10% of the country’s GDP. 
Flowing more than $11 million annually into 
local communities has also helped to dramat- 
ically increase wildlife populations, including 
those of elephants that are declining elsewhere 
in Africa. But without tourism, poaching has 
reappeared — in April, two endangered black 


rhinoceroses (Diceros bicornis) were killed on 
community lands for the first time in almost 
three years. 

Goals and targets that rely on a growing 
global economy will not be met. For exam- 
ple, making energy affordable and clean 
will require the creation of new markets and 
financing. Boosting industry, innovation and 
infrastructure will require extra investment. 
Even before COVID-19, financing for the SDGs 
was $2.5 trillion short. 


“Cash-strapped 
governments need to 
focus ona few broad 
strategic goals.” 


Concerns over targets that conflict with 
one another have been raised before’; now, 
they are pressing. For example, improving the 
transport network in developing countries is 
a key focus of the goal for industry, innova- 
tion and infrastructure. Yet, extending roads 
into wilderness puts more people in the path 
of new pathogens. Construction is damaging 
forests and other fragile environments across 
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the tropics’, counter to the goal to protect life 
onland. A proposed highway through Africa’s 
Serengeti region would cause irreparable harm 
to a protected area that generates more than 
$80 million per year from tourism’. Similarly, 
marine traffic pollutes the air and ocean, and 
puts the goal to protect ‘life under water’ under 
threat. Air travel moves people, money and 
ideas, but helped to spread SARS-CoV-2 rapidly 
around the world*. 


What todo? 


Prioritize win-wins. Cash-strapped govern- 
ments need to focus on a few broad strategic 
goals°. This will inevitably upset groups that 
support goals and targets that are de-empha- 
sized. But sometimes that cannot be avoided. 

Priorities will be hard to identify among so 
many diverse targets. Some have clear and 
quantitative aims — for example, to eradicate 
extreme poverty for all people everywhere by 
2030, such that no one is living on less than 
$1.25 a day. Others are more diffuse — suchas 
promoting public procurement practices that 
are sustainable, in accordance with national 
policies and priorities. 

Fortunately, some of the goals underpin 
or interact positively with many others’. 
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Illegal trade in endangered species, such as pangolins, increases the risk that people are exposed to viruses from the wild. 


And some become more urgent in light of 
COVID-19. For example, 30 (18%) of the targets 
would help to lessen the likelihood of another 
global pandemic. Reducing wildlife trafficking 
and the supply and demand of illegal wildlife 
products, for instance, would reduce the 
probability that new viruses would transfer 
to humans’. Three further targets — achiev- 
ing universal health coverage, bolstering 
the health workforce and strengthening the 
capacity of early-warning systems for global 
health risks — will slow the cascading impacts 
of COVID-19 in low-income nations. 

Experts in decision science, cost-benefit 
analysis and socio-ecological systems should 
complement political representatives in 
determining which SDG targets should be 
prioritized. The Fifth Assessment Report of the 
Intergovernmental Panel on Climate Change 
(IPCC) shows how information from various 
scientific disciplines can support difficult 
decisions on climate-change mitigation and 
adaptation options®. 


Decouple development and growth. 
COVID-19 is a stress test of our globalized 
economy and of our global goals for a more 
sustainable planet. Just as when global banking 
faced, and failed, asimilar test in 2008, it must 
bea learning experience. The financing, over- 
sight andimplementation of the SDGs have to 
be reformed. 

Sustained per-capita economic growth 
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for all countries is itself an SDG target, 
which points to just how deeply its pursuit is 
ingrained in the modern world. But the most 
common measure used, GDP, is distorting — it 
assigns value to undesirable factors, such as 
dangerous jobs, traffic jams and pollution’. 
And growthcannot continue forever ona finite 
planet that’s already over-exploited”®. 

As a result, many have long argued that 
economies should focus on development 
(improving well-being) rather than on 
growth (increasing economic throughput)”. 
Many investors now recognize that maxi- 
mizing short-term growth cannot come at 


“Ifthe world’s economic pie 
cannot increase, it must be 
sliced in different ways.” 


the expense of clean air and water, a stable 
climate, peaceful communities and resilient 
ecosystems. Measures that bake explicit social 
and environmental goals into financial instru- 
ments, such as green bonds, sustainability 
bonds and impact investing, are growing in 
popularity. And this year, BlackRock of New 
York City, the world’s leading asset-manage- 
ment firm, joined the Climate Action 100+ 
investor initiative to push the world’s largest 
corporate greenhouse-gas emitters to take 
action on reducing emissions. 
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Other key tenets of globalization, such as 
the value of interconnectedness for efficien- 
cy’s sake, must be questioned, too. Yes, trade, 
travel and telecommunications enhance the 
well-being of billions of people and have mar- 
shalled flows of protective equipment and 
technical expertise to fight the COVID-19 pan- 
demic. But interconnections also increase the 
likelihood that future global pandemics will 
emerge and spread, and of financial contagion 
andthe erosion of social protections for work- 
ers. Full social and environmental costs, as well 
as benefits, now need to be re-examined. 

Slowing another form of growth — popu- 
lation — should also be a priority. Historical 
missteps mean that such discussions can be 
contentious. But if the world’s population 
rises, as predicted, to 9.7 billion by 2050, it 
will exacerbate all other threats to sustain- 
ability. The SDG of empowering women and 
educating girls is thus crucial”. Even more 
important will be stabilizing population size 
inhigh-income countries, where consumption 
and environmental impacts are much higher 
than in low-income nations, and where up to 
40% of pregnancies are unintended. There is 
therefore both a need and an opportunity to 
reinvigorate research and action on the scale 
of human activity on our finite planet. 


Overhaul funding. If the world’s economic 
pie cannot increase, it must be sliced in dif- 
ferent ways. A short-term solutionis for OECD 
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economies to reduce perverse underwriting 
of enterprises that are anathema to the SDGs. 
For example, government subsidies to the fos- 
sil-fuel industry reached $4.7 trillion (6.3% of 
global GDP) in 2015. Continued reliance on 
fossil fuels limits success in numerous SDGs, 
from sustainable energy and cities to climate 
change and biodiversity conservation. 
Another way to re-divide the pie is to reinin 
corporate profits. In 2019, Fortune 500 compa- 
nies collectively posted profits of $1.1 trillion. 
This figure is 200 times the annual budget of 
the World Health Organization. Although 
many companies will face losses and bank- 
ruptcy as a result of the pandemic, there are 
ways to recoup funds to support the SDGs. 
Curbing tax avoidance is one — each year, 
low-income countries lose 1.3% of their GDP 
through tactics to avoid corporate taxes. 


Companies can also have explicit aims to serve 
the public good and the SDGs. For example, 
‘Certified B Corporations’ balance profit with 
purpose. Their numbers are increasing and 
include, for example, food giant Danone North 
America and global lifestyle brand Patagonia. 
Bancolombia, South America’s third-largest 
bank, and consumer goods firm Unilever 
invest in companies that deliver social and 
environmental profits. 

The mind-boggling sums invested in military 
defence are also at odds with the global-scale 
cooperation to which nations committed 
under the SDGs. Climate change is increasingly 
recognized as a national-security threat, and 
in the wake of COVID-19, calls to treat future 
pandemics as such will also increase. Diverting 
funds from armaments to address such secu- 
rity threats would provide a funding pathway 


COVID-19 IMPACTS ON SUSTAINABLE DEVELOPMENT GOALS 


SDG Status 


Example of target(s) affected 


Threatened* 
and mitigatest 


Goal 1: No poverty 


Target 1.2: halve proportion of people living in poverty 
by 2030 
Target 1.4: provide equal access to basic services 


Goal 2: Zero hunger Threatened Target 2.3: double agricultural productivity and incomes 
of small-scale food producers 
Goal 3: Good health Threatened Target 3.8: achieve universal health coverage 


and well-being and mitigates 


Goal 4: Quality education Threatened Target 4.1: provide free, equitable and quality education 
for all children 

Goal 5: Gender equality Partially Target 5.4: value unpaid care and domestic work by 
threatened? providing public services and policies 

Goal 6: Clean water Threatened Target 6.1: give access to safe and affordable drinking 

and sanitation water for all 

Goal 7: Affordable and Threatened Target 7.3: double global rate of improvement in energy 

clean energy efficiency 

Goal 8: Decent work Threatened Target 8.1: sustain per capita economic growth 

and economic growth 

Goal 9: Industry, Threatened Target 9.4: upgrade infrastructure and retrofit industries 


innovation and 


and aggravates§ 


to make them sustainable 


infrastructure 

Goal 10: Reduced Threatened Target 10.1: sustain above-average income growth of the 

inequalities bottom 40% of the population 

Goal11: Sustainable Threatened Target 11.2: give access to safe, affordable and 

cities and communities sustainable transport systems for all 

Goal 12: Responsible Partially Target 12.5: reduce waste generation through prevention, 

consumption and threatened reduction, recycling and reuse 

production 

Goal 13: Climate action Threatened Target 13.A: mobilize US$100 billion annually by 2020 
for the Green Climate Fund to address the needs of 
developing countries 

Goal 14: Life below water Partially Target 14.1: by 2025, prevent marine pollution of all kinds 

threatened 
Goal 15: Life on land Threatened Target 15.7: end poaching and trafficking of protected 
and mitigates species and address demand and supply of illegal 

wildlife products 

Goal 16: Peace, justice Partially Target 16.1: reduce all forms of violence and related 

and strong institutions threatened deaths everywhere 

Goal 17: Partnerships Partially Target 17.2: developed countries should commit at 

for the goals threatened least 0.7% of gross national income in overseas aid for 


developing and 0.15% to least-developed nations 


*Most targets unachievable. tAchieving some targets would have helped prevent pandemic impacts. *Some targets affected. 


§Achieving target would have made pandemic impacts worse. 
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to the SDGs that should be encouraged. Pro- 
tests in the United States are now catalysing 
calls to downscale and demilitarize municipal 
law-enforcement agencies, re-routing funds to 
initiatives on mental health and other social 
services. A similar process at the global scale 
would unlock immense resources. 


Road ahead 


UN conventions have been another casualty 
of COVID-19. The 2020 UN Biodiversity Con- 
ference and the 26th United Nations Climate 
Change conference (COP26) have been post- 
poned until next year. Although global crises 
cannot afford delays, these postponements 
provide an opportunity for the lessons learnt 
from COVID-19 to be codified into the agree- 
ments that will define life on Earth into the 
next century. 

We therefore urge the UN’s High-level 
Political Forum to work out how and when to 
update the SDGs. Every goal and target should 
be screened according to three points: is this 
a priority, post-COVID-19; is it about devel- 
opment not growth; and is the pathway to it 
resilient to global disruptions? 

It is our hope that 75 years from now, the 
tragedy of 2020 will also be remembered as a 
positive watershed, after which we built back 
better. 
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Readers respond 


Correspondence 


US visa changes leave 
postdocsin limbo 


Six of us are postdocs from 
abroad who work in the United 
States. We are therefore deeply 
concerned about the latest 
uncertainties over US visas, 
announced last month (see 
Nature http://doi.org/d2h4; 
2020). Coming ontop of the 
havoc wreaked by the COVID-19 
pandemic (see, for example, 
Nature 582, 449-450; 2020), the 
changes mean that researchers’ 
careers are now under grave 
threat. 

Things were already difficult 
under the preceding visa 
regime. Those of us in this 
position were wary of visiting 
our home countries, in case our 
return to the United States was 
blocked. The US government’s 
imposition of travel restrictions 
has made matters worse, 
thwarting even urgent trips back 
to our families. 

Renewing a non-immigrant 
visa is next to impossible at the 
moment, and legal advice is hard 
to obtain because of the fluidity 
and confusion of the situation. 
Applying for a new visa is not 
an option, so long as embassies 
and consulates are closed 
and travel restrictions apply. 
Meanwhile, our projects and job 
applications are on hold. 

We value the opportunity to 
work in world-class laboratories. 
In return, we contribute a highly 
motivated and affordable talent 
pool to our host institutions. 
We urge the United States 
to safeguard international 
postdocs to reinforce these 
mutual benefits. 


Amir H. Behbahani*, California 
Institute of Technology, Pasadena, 
California, USA. 
amirhb@caltech.edu 

*On behalf of 7 correspondents, 
see go.nature.com/3f2pnyl 


Landmark 100 years 
of climate modelling 


This year marks the centenary 
of the seminal work on climate 
modelling by the Serbian 
mathematician and geophysicist 
Milutin Milankovic¢. 

In 1920, Milankovié published 
his book Mathematical Theory 
of Thermal Phenomena Caused 
by Solar Radiation. This linked 
long-term climatic changes 
to astronomical factors that 
affect the amount of energy 
Earth’s surface receives from 
the Sun. In collaboration with 
geophysicist Alfred Wegener 
and meteorologist Wladimir 
K6ppen, Milankovié used this 
model to determine Earth’s past 
climatic cycles over thousands 
of years, culminating in the 
monumental and influential 
1941 work Canon of Insolation 
and the Ice-Age Problem. This 
led to the recognition of regular 
changes in key astronomical 
parameters: the eccentricity of 
Earth’s orbit around the Sun, 
and the obliquity and precession 
of Earth’s rotational axis. 

As aresult, science could at 
last explain the distribution 
of ancient large-scale glacial 
deposits in areas that now 
have a temperate or warm 
climate. These periodic 
climatic oscillations, known 
as Milankovié cycles, also 
explained the repetitive 
glaciations that occurred during 
Earth’s history. Crucially, these 
cycles enabled prediction of 
future climate changes. 


Marco Romano Sapienza 
Universita di Roma, Rome, Italy. 
marco.romano@uniroma’.it 


Bruce Rubidge University of the 
Witwatersrand, Johannesburg, 
South Africa. 
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Deconstruct racism 
in medicine 


COVID-19 is four times more 
likely to severely affect African 
Americans than their white 
counterparts (see go.nature. 
com/37fffny). Structural racism 
in our society undoubtedly 
contributes to this stark 
difference. As physician- 
scientists, we havea duty to 
break this cycle of disadvantage 
through our clinical work, 
scientific inquiry and education 
efforts. 

Health inequity is 
perpetuated by social, economic 
and environmental disparities 
inthe African American 
community. Researchers 
focusing on patient-oriented 
studies should ensure that 
cohorts are representative of 
racial demographics. Too few 
people of colour currently 
enrol in clinical trials. This 
stems, in part, from mistrust, 
after the US Public Health 
Service’s scandalous 1932-72 
Tuskegee study of untreated 
syphilis in Black males (see 
S. M. Reverby Nature 567, 462; 
2019). Clinical researchers now 
have an obligation to patients 
and their families to advocate 
and educate on the risks and 
benefits of participation in 
clinical trials. 

As educators, we must also 
remove the unconscious bias 
that affects student selection 
and commit to mentoring 
students of colour. This will 
expand the pipeline of under- 
represented scientists and 
better equip us to tackle racial 
disparities in a clinical setting. 


Talia H. Swartz Icahn School of 
Medicine at Mount Sinai, New 
York, New York, USA. 
talia.swartz@mssm.edu 


Boghuma Titanji Emory University 
School of Medicine, Atlanta, 
Georgia, USA. 
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Dreadlocks and 
discrimination 


As an Afro-Colombian soil 
ecologist with dreadlocks, I 
have encountered prejudice and 
scepticism about my profession 
countless times — from airport 
and immigration authorities, the 
public at outreach events and 
even colleagues at conferences. 
Such experiences reinforce 
my conviction that, as scientists, 
it is our professional, civic 
and moral duty to strongly 
denounce and combat systemic 
and structural racism and its 
intersectionality with other 
forms of oppression and 
discrimination. We must study 
its causes (see E. Culotta Science 
336, 825-827; 2012), as wellas 
its tragic consequences (see, for 
example, A. Mesic et al. J. Natl 
Med. Assoc. 110, 106-116; 2018). 
There is also plenty of 
work to do to eliminate racial 
under-representation in 
science. For example, a study 
published earlier this year 
shows that PhD students from 
under-represented groups 
in the United States innovate 
at higher rates than do those 
inthe majority, but that their 
contributions are less likely 
to lead to academic positions 
(B. Hofstra et al. Proc. Natl Acad. 
Sci. USA 117, 9284-9291; 2020). 


César Marin University of 
O'Higgins, San Fernando, Chile. 
cesar.marin@uoh.cl 
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News & views 


Coronaviruses 


Going back in time for an 
antibody to fight COVID-19 


Gary R. Whittaker & Susan Daniel 


Efforts are intensifying to try to harness antibodies as a 
therapy for COVID-19. A study reveals the insights that can 

be gained from antibodies made by a person who had a 
coronavirus infection that caused the disease SARS. See p.290 


The COVID-19 pandemic is the biggest 
public-health crisis ina century, and the devel- 
opment of medical interventions to combat 
the SARS-CoV-2 coronavirus is a top priority. 
On page 290, Pinto et al.' provide evidence 
needed to take one of the crucial first steps 
for such efforts in the developing arena of 
antibody immunotherapy. 

The level of protection provided by the 
immune system in response to SARS-CoV-2 
exposure and infection is a hotly debated 
topic’. It is thought that one major arm of 
the immune response to such infection is the 
development of antibodies that recognize 
the virus. Of particular interest are antibodies 
that bind to a protein on the viral surface 
known as the spike protein. Coronaviruses 
derive their name from their distinctive, 
crown-like (coronal) viral silhouettes, which 
are due to these proteins. 

Antibodies that recognize and bind to 
the viral ‘spike’ can block its ability to bind 
the ACE2 receptor protein on human cells. 
An interaction between the spike protein 
and ACE2 is part of a process that can enable 
coronaviruses to enter human cells. Thus, 
antibodies that could hinder spike-protein 
function would block infection; such 
antibodies are termed neutralizing antibodies. 

Much remains to be learnt about the 
immunological responses to SARS-CoV-2. 
Nevertheless, it is becoming clear that anti- 
bodies taken from the blood serum of people 
who have recovered from COVID-19 can 
be used for treatment by being transfused 
into other people who have the disease’. 
Such ‘convalescent sera’ approaches are 
highly attractive, particularly as an imme- 
diate treatment option. That’s because 
more-conventional therapeutics, such as 
drugs or vaccines, are unlikely to be available 


for some time. A more high-tech approach to 
using convalescent sera is the manipulation 
of antibody-producing B cells taken from the 
blood of people who had COVID-19 or other 
coronavirus infections. Each B cell makes one 
unique antibody, and clonal populations of a 
B cell of interest can be used to generate an 
identical pool of a particular desired antibody 
known asa monoclonal antibody. 

To accelerate the process of therapeutic 
development, Pinto and colleagues ‘went 
back in time’, and turned to samples of B cells 
collected froma person who had been infected 
by the coronavirus SARS-CoV. This virus, which 
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is similar to SARS-CoV-2, caused an outbreak in 
2003 of a disease called severe acute respira- 
tory syndrome (SARS). The hope with such an 
approach is that the resemblance between the 
two viruses might mean that some antibodies 
that recognize SARS-CoV also recognize and 
neutralize SARS-CoV-2. 

The ‘head’, or receptor-binding domain 
(termed S1), of the spike protein is the most 
accessible region of the protein for antibodies 
to bind to. However, this domain exists in dif- 
ferent dynamic states, and debate has arisen 
over whether it is ‘masked’ from the immune 
system by a shell of carbohydrate molecules’. 
The identification ofa functional antibody that 
targets this region is therefore not a trivial pro- 
cess. Pinto etal. combined blood cells taken in 
2004 and 2013 from a person who had recov- 
ered from SARS, and searched for antibodies 
that could recognize SARS-CoV-2 (Fig. 1). 
Of the 25 different monoclonal antibodies 
that the authors studied, 4 recognized the 
receptor-binding domains of both SARS-CoV 
and SARS-CoV-2 spike proteins. One antibody, 
termed S309, was selected for further study 
on the basis of its high-affinity binding to this 
domain when tested in vitro. 

Pinto and colleagues used cryo-electron 
microscopy to visualize the interaction 
between the S309 antibody and the 
SARS-CoV-2 spike protein. This revealed 
that S309 binds to an accessible site in the 


SARS-CoV-2 
infection blocked 


Figure 1| An antibody that blocks coronavirus infections. Pinto et al.’ have identified a human antibody 
that blocks infection by SARS-CoV-2, the coronavirus that causes COVID-19. The authors made this discovery 
by examining antibodies made by a person who had recovered in 2003 from infection with the related 
coronavirus SARS-CoV, which causes severe acute respiratory syndrome (SARS). a, Coronaviruses such 

as SARS-CoV infect human cells by binding to the protein ACE2. b, Pinto and colleagues analysed blood 
samples taken in 2004 and 2011 froma person who recovered from SARS, and examined antibodies made 

by the immune cells from the samples. They identified an antibody (named S309) that bound to the spike 
protein of SARS-CoV and prevented infection by this virus. c, The authors found that this antibody bound to 
asimilar region of the spike protein of SARS-CoV-2 and prevented infection by the virus. 
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receptor-binding domain of the spike protein 
that has an attached carbohydrate molecule. 
This region is not part of the key area that 
directly binds to ACE2. The site that S309 rec- 
ognizes is evolutionarily conserved in spike 
proteins across a range of bat coronaviruses 
(in the genus Betacoronavirus lineage B; sub- 
genus Sarbecovirus) that have similarities to 
the SARS-like coronaviruses. This raises the 
possibility that such an antibody could have 
wide applicability in tackling related viruses. 
Not only, then, is this antibody of interest when 
investigating ways to manage the COVID-19 
pandemic in the years ahead, but it might also 
be considered for use in preventing future out- 
breaks of related animal viruses, if they make 
the leap to causing infection in humans. 

Ultimately, it seems unlikely that a robust 
treatment for COVID-19 will rely on a single 
antibody. Rather, as was the case for SARS, 
a synergistic approach combining different 
monoclonal antibodies in an antibody cocktail 
might be more effective’. For such approaches 
to move forwards, evidence of effective anti- 
body neutralization from in vitro studies will 
be needed, along with in vivo data assessing 
how well an antibody can boost other aspects 
of the immune response — by enlisting other 
immune cells to tackle the infection, for exam- 
ple. There are many promising avenues to 
explore in these efforts. 

Pinto and colleagues got a head start with 
their work by exploring pre-existing anti- 
bodies, and they should now have more B-cell 
populations to mine. Many other teams, to 
give just some examples”* ”, have also pre- 
sented useful discoveries in the hunt for 
antibodies that can target SARS-CoV-2. The 
next steps will be to test individual antibod- 
ies and antibody cocktails in animal models, 
to determine whether they offer protection, 
and then to assess their safety and effective- 
ness in human clinical trials. An accelerated 
path might narrow the time lag between anti- 
body discovery and proof-of-concept trials 
in humans toas little as five or six months". 

The most recent prominent example 
of immunotherapy for infectious disease 
relates to battling the Ebola virus. In con- 
cert with vaccines and conventional, 
small-molecule-drug trials, the development 
of monoclonal-antibody therapies for Ebola 
has progressed rapidly. Cocktails of anti- 
bodies, beginning with one called ZMapp, 
that target a key Ebola viral protein called GP 
in two crucial regions of the protein, are con- 
tinuing to be developed». This progress in 
efforts to tackle Ebola gives hope for similar 
immunotherapy achievements in targeting 
SARS-CoV-2. Pinto and colleagues’ work marks 
a major step towards that much-anticipated, 
and much-needed, success. 


Gary R. Whittaker is in the Department of 
Microbiology and Immunology, and in the 
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Biomolecular Engineering, Cornell University, 
Ithaca, New York 14853, USA. 
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Atmospheric CO, removed 
by rock weathering 


Johannes Lehmann & Angela Possinger 


Large-scale removal of carbon dioxide from the atmosphere 
might be achieved through enhanced rock weathering. It now 
seems that this approach is as promising as other strategies, 
interms of cost and CO,-removal potential. See p.242 


Achieving targets for mitigating global 
warming will require the large-scale withdrawal 
of carbon dioxide from the atmosphere. On 
page 242, Beerling et al. report that enhanced 
rock weathering in soils has substantial techni- 
caland economic potential as a global strategy 
for removing atmospheric CO,. When crushed 
basalt or other silicate material is added to 
soil, it slowly dissolves and reacts with CO, 
to form carbonates. These either remain 
in the soil or move towards the oceans. The 
authors argue that this method would enable 
between 0.5 billion and 2 billion tonnes of CO, 
to be removed fromthe atmosphere each year. 
This rate is similar to that of other land-based 
approaches’, such as the accrual of organic 
carbon in soil, carbon capture and sequestra- 
tion in geological formations, and the addition 
of biochar (a carbon-rich material) to soil. 
Beerling and colleagues find that removing 
atmospheric CO, through enhanced 
rock weathering would cost, on average, 
US$160-190 per tonne of CO, in the United 
States, Canada and Europe, and $55-120 per 
tonne of CO, in China, India, Mexico, Indonesia 
and Brazil. Furthermore, the authors report 
that China, the United States and India — the 
three largest emitters of CO, from fossil-fuel 
use — have the highest potential for CO, 
removal using this method. However, they also 
note that the application of silicate material to 
soil (Fig. 1) requires careful assessment of the 
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risks, suchas the possible release of metals and 
persistent organic compounds (compounds 
resistant to environmental degradation). 

Despite the enthusiasm the authors’ find- 
ings might generate, it is crucial to stress that, 
even under optimistic assumptions, enhanced 
rock weathering will sequester only some of 
the annual global carbon emissions from 
fossil-fuel use. Therefore, reducing these 
emissions should still be the top priority for 
averting dangerous climate change. But, as 
Beerling et al. note, any approach is insuffi- 
cient alone, and should be considered as part 
of a portfolio of options. 

Several other land-based carbon-seques- 
tration techniques rely on soils. However, 
inorganic-carbon sequestration by rock 
weathering is fundamentally different from 
organic-carbon sequestration. The latter relies 
on photosynthesis by plants to remove CO, 
fromthe atmosphere, and on soils to retain the 
plant carbon, mostly in the form of microbial 
remains. In the future, therefore, scientists 
should pay closer attention to what they mean 
by ‘carbon sequestration’ — is it inorganic or 
organic? 

The sequestration of atmospheric CO, 
through enhanced rock weathering shares 
some of the principal appeal, but also the 
challenges, of organic-carbon sequestra- 
tion. The fact that crop production benefits 
is certainly a key asset of both methods. Inthe 
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case of enhanced rock weathering, the added 
rock contains essential plant nutrients, such 
as calcium and magnesium, as well as potas- 
sium and micronutrients that promote crop 
production in several ways. We would go even 
further than the authors do, to claim that these 
nutrients are currently insufficiently supplied 
in agriculture. 

Increasing soil pH alone would substantially 
boost crop yields in many regions of the world, 
because it is possible that low pH constrains 
crop production on more than 200 mil- 
lion hectares of arable and orchard soils. This 
area is equivalent to about 20% of the total 
extent of these soils (967 million hectares; 
see go.nature.com/3Ircajd). Consequently, 
ona global scale, acidity is the most important 
soil constraint for agriculture*. However, there 
have been no detailed multi-regional analyses 
of the difference in crop yield between low-pH 
and optimum-pH soils, and such investigations 
would benefit the study of synergies between 
carbon-sequestration methods. The proposed 
rock additions could conceivably mitigate the 
low use and supply shortages of agricultural 
limestone in several regions®. Furthermore, 
calcium improves root growth in acidic sub- 
surface soil®, with crucial knock-on effects 
through greater water uptake by plant roots. 

Co-deployment of enhanced rock 
weathering with other soil-based seques- 
tration approaches might both reduce 
limitations and maximize synergies’. Beerling 
and colleagues’ study hints at some of these 
opportunities and at constraints that have 
procedural and soil-biogeochemical aspects. 
Greater crop growth will increase the input of 
crop residue (the materials from crops that are 
left in a field after harvesting) to the soil, and 
thereby enhance the accrual of organic car- 
bon. However, the possibility that interactions 
between calcium and organic matter impede 
the return of CO, to the atmosphere has been 
sparsely explored, and there is little informa- 
tion on the effects of magnesium. In princi- 
ple, calcium can reduce the decomposition 
of organic matter by facilitating adsorption to 
clay, inclusion in carbonates or aggregation®. 
But the indirect effects of calcium through 
changes in microbial ecology or interactions 
with organic compounds, rather than inter- 
actions only between organic compounds and 
clay minerals, are rarely studied. 

Ifthe synergy becomes a trade-off between 
organic-carbon sequestration and crop pro- 
duction, the organic-carbon content of soil 
could decrease, threatening the livelihoods of 
farmers, and even food security. Any carbon 
sequestration involving soils is a formidable 
challenge to incentivize, predict and monitor’, 
because the sequestration technologies must 
be used on vast areas of land that are operated 
by hundreds of millions of farmers. Inevita- 
bly, there will be individual cases in which 
positive-yield projections are not met or crop 


Figure 1| Application of silicate material to cropland. Beerling et a/.! demonstrate that enhanced rock 
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weathering, achieved by adding crushed basalt or other silicate material to soil, is an effective strategy for 


removing carbon dioxide from the atmosphere. 


yields even decrease, where incentives fail to 
persuade farmers, or where supply chains 
break down. But scientists should not be 
deterred from evaluating such technologies, 
and shouldinstead accept that farmers need to 
bein the driving seat in adapting soil manage- 
ment to meet their specific site and crop-pro- 
duction goals. A concerted global effort will be 
required to develop site-specific optimization 
through farmer-centred research. 

Fertilizer distribution networks are 
common in many parts of the world. But even 


“China, the United States 
and India have the highest 
potential for CO, removal 
using this method.” 


where these networks are in place, success in 
the adoption of enhanced rock weathering 
might not rely on its crop-production bene- 
fits alone. We posit that carbon markets are 
required, and that it would be helpful if they 
incentivized socially and environmentally 
sound implementation”. For technologies to 
be eligible, it must be shown that they provide 
extra incentives for adoption (additionality), 
beyond what increased soil fertility would 
deliver. We emphasize that implementation 
of enhanced rock weathering and other soil- 
based carbon sequestration must consider 
equitable and financially sound incentives for 
farmers that overcome challenges of addition- 
ality, among others”, in a proactive way. 
Consequently, the main lesson here might 
be that several of the major potential tech- 
nologies for removing atmospheric CO, 
could generate substantial benefits for food 
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production, and are centred around managing 
soils. Farmers must be fully behind such a 
global effort or it will fail. Scientists might 
need to recognize that climate-change miti- 
gation is not a sufficient incentive on its own, 
and that benefits to crop growth will need to be 
prioritized, as will financial incentives. Such an 
approach of financially supporting soil health 
and crop production could emerge as our best 
near-term solution to the problem of removing 
CO, from the atmosphere. 
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Immunology 


Anantiviral response 
beyond immune cells 


Tomas Gomes & Sarah A. Teichmann 


Fibroblast, epithelial and endothelial cells are more than just 
the scaffold of an organ — it emerges that they communicate 
with immune cells and are primed to launch organ-specific 
gene-expression programs for antiviral defence. 


Immune-system responses to disease-causing 
agents rely on a complex web of interactions 
between immune cells that are underpinned 
by robust regulatory mechanisms. Most of 
our understanding of the immune system 
revolves around these cells, yet cells gener- 
ally thought of as having a mainly structural 
role can also respond to invading organisms. 
Writing in Nature, Krausgruber et al.' report a 
multi-organ examination of gene-expression 
programs for such structural cells in mice, 
revealing the roles of these cells in signalling 
networks used for defence purposes. The 
authors found that the response of structural 
cells to external invaders is regulated and 
tailored to the particular organ in question. 

Structural cells, such as fibroblasts and 
endothelial and epithelial cells (Fig. 1), are 
present in most organs and provide more 
than just support?’. Fibroblasts form part of 
the connective tissue and help to maintain the 
extracellular matrix material that surrounds 
cells. Endothelial cells line the interior of 
vessels such as blood vessels and, along with 
epithelial cells, which are present on the sur- 
face of organs, can be involved in responses 
to infection, either directly or through 
interactions with immune cells’. 

To understand the role of these three types 
of cell in immune responses, Krausgruber 
and colleagues isolated them from 12 dif- 
ferent tissues in healthy mice. The authors 
used RNA sequencing to determine the 
genes expressed by the cells, and searched 
for known immune-associated genes. 
Krausgruber et al. also characterized the 
cells’ chromatin — the complex of DNA and 
protein in the nucleus — to pinpoint genomic 
regions that were poised to start gene expres- 
sion. This was done using a method called 
ATAC-seq to determine genome-wide ‘open’ 


chromatin accessibility, and the authors 
identified active promoter regions by track- 
ing atype of modification called H3K4me2 on 
the DNA-binding histone 3 protein. Together, 
these methods opened a window on the 
transcriptional regulatory circuits that govern 
the identity and function of these cells. 
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Although the three cell types can be defined 
by the expression of genes corresponding to 
specific marker proteins found on the cell 
surfaces, the three cellular lineages also 
presented features that were characteristic 
of their local organ environment. Across the 
genome, the data sets for gene expression, 
open chromatin and active promoters indi- 
cated that the different cell types in an organ 
were more similar to each other than was a 
given cell type to the same cell type in differ- 
ent organs. This is a crucial observation that 
provides a foundation for future studies on 
the specific role that structural cells have in 
the function of each organ. 

The authors searched the gene-expression 
data of structural cells to see which receptors 
and ligand molecules they expressed, and then 
matched the cells to possible interaction part- 
ners by mining previously published RNA-se- 
quencing data for immune cells. They then 
assembled a computationally derived network 
that unveils possible cell-type- and organ-spe- 
cific interactions involving structural and 
immune cells, and defines the baseline for 
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Figure 1| Structural cells are poised for organ-specific defence responses. a, Krausgruber et al.' analysed 
three cell types — fibroblasts, and endothelial and epithelial cells — that are usually considered to havea 
structural role in organs. They found that, in mice, these cells signal to and interact with cells of the immune 
system (suchas T cells, B cells, monocytes, macrophages and NK cells) to provide organ-specific defence 
responses. The authors report that structural cells express genes encoding chemokine proteins (for the 
examples given, the chemokines were Ccl25, Ccl21a, Cxcl10, Cxcl12, Ccl2 and Ccl13) that can attract immune 
cells. Structural cells also express other genes encoding ligands and receptors (not shown) that might aid 
communication with immune cells. The molecular interaction patterns identified were usually unique to 
each organ. b, Krausgruber et al. used RNA sequencing to profile gene expression in structural cells, and also 
assessed the state of chromatin (DNA wrapped around structures called nucleosomes) in the cells. Some 
genes were poised for expression — they had chromatin in an open state, and the authors described these 
genes as having unrealized potential. After infection with lymphocytic choriomeningitis virus (LCMV), these 
genes were expressed ina process that was often aided by the cytokine proteins IL-6 and IFN-y (possibly 
secreted by immune cells). These genes were activated in a cell-type- and organ-specific manner, and 
constituted a key part of the early response of structural cells to infection. 
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routine interactions between immune cells 
and structural cells, from which further 
cellular crosstalk would develop on infection. 

To understand more about how structural 
cells might prepare to trigger a gene-expression 
program for defence purposes, the authors 
assessed their gene-expression data together 
with the chromatin-accessibility profiles of 
the corresponding gene promoters (DNA 
sequences that aid gene expression). An open 
chromatin region encompassing a gene’s pro- 
moter is known to be a reliable indicator of 
expression of the gene*. The authors used these 
combined data to look for outliers — genes 
that had an open accessible promoter but 
low levels of expression, on the assumption 
that such genes have what Krausgruber and 
colleagues describe as unrealized poten- 
tial. This indicates genes that are probably 
poised for a rapid response when infection 
occurs. The approach highlighted a group 
of genes encoding a substantial number of 
immune-associated proteins, and examples 
of these were most evident in structural cells 
fromthe skin, liver and spleen. These genes are 
worthy of further study that focuses on how 
the structural cells that express them respond 
to infection and protect the organ that is their 
home. 

The authors confirmed that they had indeed 
identified genes poised for aroleinanimmune 
response by infecting mice with lymphocytic 
choriomeningitis virus (LCMV) and then mon- 
itoring gene expression by RNA sequencing of 
structural cells. LCMV is a well-studied virus 
that affects most organs, and this allowed 
Krausgruber and colleagues to distinguish 
organ-specific from global defence responses. 
Eight days after infection, up to 57.9% of the 
genes of unrealized potential had been acti- 
vated in structural cells, with notably high 
responses in fibroblasts and endothelial cells 
inthe liver, spleen, lungs and large intestine. 

Furthermore, the authors found that 
an antiviral response was evident in these 
gene-expression profiles. When infected 
and non-infected animals were compared, 
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the infected animals had higher levels of 
expression of transcription factors and 
immune-associated signalling proteins called 
cytokines that are involved in pathways asso- 
ciated with expression of the antiviral protein 
interferon. In response to the viral infection, 
structural cells also expressed small pro- 
teins called chemokines that attract immune 
cells. This was a surprise, because chemo- 
kine secretion has been mainly associated 
with immune cells. The authors propose that 
their predicted interaction network between 
immune cells and structural cells is altered 
on LCMV infection, and suggest that, on 
infection, structural cells in various organs 
increase interactions with immune cells such 
as monocytes, macrophages and B cells. 

To dissect the effects of signalling in 
response to LCMV infection, the authors 
injected individual cytokines, of types 
detected in the antiviral response, into the 
bloodstream of mice that did not have an LCMV 
infection. Krausgruber et al. then sequenced 
the RNA in structural cells from the organs with 
the greatest previously observed response 
to LCMV. They found that gene-expression 
changes were more evident in fibroblasts 
and endothelial cells than in epithelial cells. 
Dissecting the gene-expression response to 
each cytokine revealed the portion of the anti- 
viral program that it controls. Among other 
interactions, this revealed that the cytokines 
IL-6 and IFN-y, possibly produced in vivo by 
immune cells, are responsible for eliciting 
much of the antiviral response of spleen 
endothelial cells by driving the expression of 
genes with unrealized potential. 

Although gene-expression programs 
involved in the immune response have been 
reported previously for some structural cells, 
Krausgruber and colleagues’ work under- 
scores these cells’ decisive role in coordinating 
organ-specific and organism-wide immune 
responses. It also indicates how functionally 
relevant candidate genes can be pinpointed 
using a combination of cell-communication 
networks and analysis of chromatin-mediated 
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regulation. One of the ultimate goals of this 
research field could be to develop cell- 
type-targeted therapies that modulate 
immune responses. This could greatly ben- 
efit cancer research, for example, because 
cancer-associated fibroblasts have a role in 
promoting tumour progression’. 

Future studies will probably focus on the 
defence responses of other types and subtypes 
of human cells in studies linked to the Human 
Cell Atlas initiative’, which is generating 
detailed molecular profiles for all human cells 
to fully describe cell-type diversity. Single-cell 
approaches could assist in profiling the RNA 
transcripts in all cell types and states of entire 
organs, in steady-state and post-stimulus sce- 
narios. The use of anew method called spatial 
transcriptomics (which monitors gene expres- 
sion in intact tissue sections rather than in 
dissociated cells), together with information 
about chromatin status, could disentangle 
the entire cellular chain of events, from the 
detection of infection to the defence response 
and immune-cell recruitment, and then finally 
to the removal of the infectious agent. By 
profiling structural cells in different mouse 
organs, Krausgruber et al. have unlocked a 
trove of knowledge about antiviral defences, 
which might be relevant to other species and 
facilitate new ways to target human diseases. 
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News & views 


Cancer evolution 


Strands of evidence 


Trevor A. Graham & Sarah E. McClelland 


DNA damage can cause mutations due to failure of DNA repair 
and errors during DNA replication. Tracking the strand of the 
DNA double helix on which damage occurs has shed light on 
processes that affect tumour evolution. 


How a cancer evolves and how mutations are 
generated are highly intertwined processes, 
and both are nearly impossible to observe 
directly. Instead, we are usually restricted 
to making inferences about them using data 
froma single snapshot in time after a cancer 
has formed. Writing in Nature, Aitken et al.’ 
show that, for a cell that has undergone DNA 
damage, sucha snapshot provides remarkably 
rich information when the two DNA strands 
that form the double helix are considered 
independently. 

DNA resembles a ladder, with the two 
‘side rails’ often called, respectively, the Watson 
and Crick strands. These are fused together 
by ‘rungs’ of two complementary nucleo- 
tide base pairs: either cytosine (C) paired 
with guanine (G) or adenine (A) paired with 
thymine (T). When acell divides, each daughter 
cell inherits either the Watson or Crick strand 
from the parent; this provides a template 
from which the other, complementary strand 
is replicated. Damage to a base can trigger a 
repair process, but if repair is not swift enough, 
the damaged base might be mispaired with an 
incorrect base during DNA replication. At the 
next round of cell division, when a daughter cell 
with such a mispaired base prepares to divide, 
the base complementary to the mispaired base 
will be added to the newly synthesized strand. 
This leads to a double-stranded mutation at 
the base pair corresponding to the original 
damaged base (Fig. 1). 

Standard practice for genome sequencing 
is to consider mutations without paying 
attention to which of the strands received 
the original damage. However, when a chem- 
ical change occurs that damages a base, 
creating a site referred to as a lesion, this 
lesion is on only one of the two DNA strands 
of the affected base pair. Aitken and col- 
leagues had the insight to see that, because 
the ‘parental’ Watson and Crick strands of 


an original cell that underwent DNA damage 
are separated into different daughter cells, 
when the cell divides, two cell lineages can be 
tracked individually by following the unique 
pattern of mutations that lesions on each of 


the parental strands generates. 

To induce DNA lesions, Aitken and colleagues 
gave mice a large dose of the carcinogenic 
molecule diethylnitrosamine. This treat- 
ment predominantly caused DNA lesions at 
T bases in liver cells, ultimately leading to 
tumour growth. When the authors examined 
the pattern of diethylnitrosamine-induced 
mutations along the genome of each tumour, 
they found long stretches of the genome 
that, compared with the original, unmutated 
genome, were highly enriched for mutations 
in which T was mutated to any other base (N). 
These mutations were derived from T lesions 
onthe DNAstrand (let’s call it the Crick strand) 
that was inherited by the daughter cell, andits 
cellular descendants, that went onto formthe 
tumour. Lesions on either strand can generate 
mutations, but it might be the case that lesions 
on only one of the parental strands generates 
tumour-promoting mutations and hence 
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Figure 1 | Tracking the connection between damage to individual DNA strands and tumour mutations. 
Aitken et al.’ analysed mutation patterns in the cells of mice that received a carcinogenic molecule called 
diethylnitrosamine, which damages thymine (T) nuclotide bases. Damage is indicated by red circles. The 
cell originally exposed to diethyInitrosamine has two DNA strands, named here as the parental Crick and 
Watson strands. The correct pairing of bases is either adenine (A) paired with T or guanine (G) paired with 
cytosine (C). When the original cell divides, an incorrect base (shown in pink) can mispair with a damaged 
T base. Then, as those two daughter cells divide, the incorrect base will pair with the complementary 
matching base (for example, a mispaired C pairs with a G on the newly made strand), which results in both 
DNA strands having a mutation at that particular base pair. The original cell division generates two cell 
lineages that have mutations arising from damage to either the parental Crick or parental Watson strand, 
respectively. Each of these lineages has mutations at distinct base-pair locations. As the cells continue 

to divide, mispairing opposite unrepaired damaged T bases continues, producing different mutations 

at the same base-pair position and generating genetic diversity. Only cell lineages with mutations that aid 
tumour growth will be found in a tumour that the mouse subsequently develops. This example shows one 
possible scenario, in which mutations arising as a consequence of damage only to the parental Crick strand 


contribute to cancer growth. 
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only one of the two daughter lineages forms 
a tumour. In the example shown in Figure 1, 
the lesions on T bases on the corresponding 
Watson parental strand received by the other 
daughter cell did not lead to the formation 
of tumour-promoting mutations, and this 
cellular lineage therefore did not contribute 
to the tumour. 

The authors realized that the pattern of 
base-pair locations that had T-to-N mutations 
enabled them to pinpoint the individual 
Watson or Crick strand that had served as 
the template strand for the first cell in the 
tumour. This template strand carried diethyl- 
nitrosamine-induced lesions mainly at T bases, 
allowing the fate of each strand to be tracked 
individually through subsequent cell divisions 
(Fig. 1). These individual ‘strands of evidence’ 
provide remarkable information about the 
process of mutation and tumour evolution. 

In gene expression, DNA is transcribed to 
produce RNA, and DNA lesions can be repaired 
by a process called transcription-coupled 
repair’. Aitken and colleagues observed that 
transcription-coupled repair occurred pref- 
erentially on the strand being transcribed, 
as opposed to the complementary strand, 
and that higher levels of transcription were 
associated with an increased frequency of 
repair. 

The authors found that failure to repair 
a DNA lesion over successive cell cycles, an 
interesting observation in itself, provided an 
unexpected source of genetic diversity. Each 
round of DNA replication ona lesion-contain- 
ing strand could lead to the incorporation of 
a different ‘wrong’, mispaired base opposite 
the lesion site in the newly synthesized strand. 
If this happened, it caused further, distinct 
mutations at the same genomic position, gen- 
erating cells in the tumour each with different 
mutations of the same base pair. 

Observing recurrent mutations at the same 
genomic site could be taken as evidence of 
convergent evolution, in which multiple indi- 
vidual mutational events at that base-pair site 
are all positively selected for during tumour 
growth. Instead, Aitken and colleagues’ find- 
ings indicate that recurrent mutations could 
result from lesion-bearing DNA strands being 
used as templates for DNA replication over 
multiple rounds of cell division. 

Intriguingly, strand tracing also provides 
a window on the selection of mutations 
associated with cancer. When a cell divides, 
a daughter cell should inherit, at random, 
either DNA strand. However, when the 
authors tracked the prevalence of sequences 
corresponding to inheritance of the paren- 
tal Watson or Crick strand of a particular 
chromosome, they noticed that the tumours 
contained one of these two strands more often 
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than would be expected by chance. 

The authors’ explanation for such preferen- 
tial strand retention is that it occurred because 
the retained strand contained a diethyl- 
nitrosamine-induced mutation in a gene that 
is important for tumour growth. Aitken et al. 
identified three potential tumour-promoting 
genes in this way, all of which are known to be 
crucial for the growth of liver tumours. Strand- 
by-strand analysis might be an unexpectedly 
useful tool for probing the tumour-promoting 
contribution of non-protein-coding regions of 
the genome, because selection can be detected 
without needing to know the background 
mutation rate — the problem of determining 
this rate has posed a challenge for methods 
previously used to study these regions’. 

It might be expected that because there is 
strand-biased prevalence of mutations cor- 
responding to diethylnitrosamine-induced 
lesions, achromosome should be enriched 
for T-to-N mutations along the entire length 
of its DNA strand. Instead, the authors found 
that, fora single chromosome, the enrichment 
of such mutations sometimes switched over 
to the other strand (and could be observed as 
A-to-N mutations). They propose that this pro- 
vides evidence of sites of aDNA-repair process 
called homologous recombination, in which 
DNA strands from a chromosome exchange 
with identical DNA sequences in a cell that is 
gearing up to divide, during an event called 
sister-chromatid exchange. 

Sister-chromatid exchange is usually an 
elusive process to monitor because it involves, 
in theory, an exchange between two identical 
DNA sequences. However, because Aitken 
and colleagues could track individual strands 
through mutation patterns, they could detect 
evidence of such events. Interestingly, a higher 
frequency of these exchange events tracked 
with a higher diethylnitrosamine-induced 
mutation burden, suggesting that diethyl- 
nitrosamine-induced damage to DNA might 
prompt sister-chromatid exchange’. Aitken 
etal. report that presumptive exchange events 
tended to occur in regions of the genome that 
were associated with lower-than-average gene 
expression and later replication, compared 
with other regions, during the cell cycle. 
Interestingly, these features are hallmarks 
of ‘fragile sites’ — genomic regions suscep- 
tible to damage during DNA replication that 
are known to be prone to sister-chromatid 
exchange*®. 

The most tantalizing question raised by 
this study is whether the DNA strand (the 
original Watson or Crick strand) in which 
damage first occurred can be tracked inhuman 
cancers. People are usually exposed to muta- 
tion-causing agents, such as those in cigarette 
smoke, for long periods of time, making it 
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probable that DNA lesions are continually 
being induced. Consequently, the mutational 
signal arising from an individual DNA strand 
would probably be obscured. However, when 
Aitken and colleagues assessed data already 
available from human cancers, they found 
that, in rare cases of sudden, acute exposure 
toa mutagenic agent, most clearly observed 
for aristolochic acid exposure (which leads to 
liver, kidney and bile-duct cancer), mutational 
signals could be traced back to identify the 
strand originally damaged. 

Chemotherapy also provides an example of 
acute exposure to mutation-causing agents, 
and can give rise to distinctive mutational 
signatures’. It will be interesting to see what can 
be learnt by applying Aitken and colleagues’ 
methods to analysing chemotherapy-treated 
cancers. Other examples of exposure to muta- 
tion-causing agents, such as acute radiation 
exposure or sunburn, might also be worth 
analysing using a strand-by-strand approach. 
Furthermore, such analysis could clarify the 
timeline of exposure to a mutation-causing 
agent: a single, large exposure might gener- 
ate a mutational signal that could be assigned 
to individual DNA strands, whereas repeated 
exposures might cause a progressively less 
distinct signal. Similarly, analysing individual 
strands might provide insight into the rate of 
lesion-repair processes, and offer anew means 
of studying defects in DNA-repair processes 
in cancer, suchas defective-mismatch repair. 

Tracking individual DNA strands is a 
reductive approach offering a powerful way 
to study DNA replication and repair processes 
that have been challenging to observe. Aitken 
and colleagues’ study shows the potential of 
this method for tackling the complexities of 
cancer — it seems that the maxim that the 
wholeis greater than the sum of the parts does 
not apply to individual DNA strands. 
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® Check for updates 


The discovery of a radioactively powered kilonova associated with the binary 
neutron-star merger GW170817 remains the only confirmed electromagnetic 


counterpart to a gravitational-wave event’”. Observations of the late-time 
electromagnetic emission, however, do not agree with the expectations from 
standard neutron-star merger models. Although the large measured ejecta mass** 
could be explained by a progenitor system that is asymmetric in terms of the stellar 
component masses (that is, with a mass ratio g of 0.7 to 0.8)°, the known Galactic 
population of merging double neutron-star systems (that is, those that will coalesce 
within billions of years or less) has until now consisted only of nearly equal-mass 

(q> 0.9) binaries®. The pulsar PSR J1913+1102 is a double system ina five-hour, 
low-eccentricity (0.09) orbit, with an orbital separation of 1.8 solar radii’, and the two 


+12 


neutron stars are predicted to coalesce in 470-,; million years owing to 
gravitational-wave emission. Here we report that the masses of the pulsar and the 
companion neutron star, as measured by a dedicated pulsar timing campaign, are 
1.62 + 0.03 and 1.27 + 0.03 solar masses, respectively. With a measured mass ratio of 
q= 0.78 + 0.03, this is the most asymmetric merging system reported so far. On the 
basis of this detection, our population synthesis analysis implies that such 
asymmetric binaries represent between 2 and 30 per cent (90 per cent confidence) of 
the total population of merging binaries. The coalescence of amember of this 
population offers a possible explanation for the anomalous properties of GW170817, 
including the observed kilonova emission from that event. 


Since its discovery’ in 2012, we have been regularly monitoring the 
double neutron star (DNS) PSRJ1913+1102 with the Arecibo radio tel- 
escope. Our observations have used the Mock Spectrometer and the 
Puerto Rico Ultimate Pulsar Processing Instrument (PUPPI) to coher- 
ently remove dispersive smearing from the pulsar signal, caused by the 
interstellar free-electron plasma along the line of sight to the pulsar. We 
analysed data from this pulsar using standard pulse timing techniques 
(see Methods). 

With a spin period of 27 ms, PSR J1913+1102 was probably the 
first-formed neutron star in this binary system; it subsequently gained 
angular momentum via accretion of matter from the progenitor to the 
second neutron star’. The timing of the pulsar has allowed a precise 
measurement of the rate of advance of the periastron, which is 
@ =(5.6501+0.0007) yr 4. Inaddition, we have now determined two 
more post-Keplerian parameters: the first is the Einstein delay 
(y = 0.471 + 0.015 ms), which describes the effect of gravitational 


redshift and relativistic time dilation due to the varying orbital veloc- 
ity and proximity of the neutron stars to one another during their orbits. 
The second is the decay of the orbital period caused by the emission 
of gravitational waves (P, = (- 4.8+0.3) x10 5 s7). 

In Fig. 1, we show the general-relativistic mass constraints corre- 
sponding to each measured post-Keplerian parameter. From @, the 
total system mass is (2.8887 + 0.0006)M. (M., solar mass), making PSR 
J1913+1102 the most massive among known DNS systems (by a 2% mar- 
gin). By combining @ and y, we obtain the individual neutron-star 
masses Mm, = (1.62 + 0.03)M, and m,= (1.27 + 0.03)M, (unless otherwise 
stated, uncertainties denote 68% confidence) for the pulsar and the 
companion, respectively, which give a mass ratio of g = m,/m, = 
0.78 + 0.03. The observed FP, is consistent with the general relativity 
prediction for these neutron-star masses; apart from confirming them, 
this effect provides a unique test of alternative gravitational theories 
that will be reported elsewhere (P.C.C.F. et al., manuscript in 
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Table 1 | Measured and derived parameters for PSR 


J1913+1102 

Parameter name Value 
Reference epoch (MJD) 57,504.0 
Observing time span (MJD) 56,072-58,747 
Number of arrival time measurements 2,541 

Solar System ephemeris used DE436 
Root-mean-square timing residual 56 us 
Reduced x’ of timing fit 1.01 


Right ascension, a (J2000) 


19 h13 min 29.05365(9) s 


Declination, 6 (J2000) 


11° 02' 05.7045(22)" 


Proper motion ina 


-3.0(5) mas yr! 


Proper motion ind 


-8.7(1.0) mas yr™ 


Pulsar spin period, P 


27.2850068680286(19) ms 


Spin period derivative, P 


1.5672(7) x10" ss" 


0.5 1.0 1.5 2.0 2.5 
Pulsar mass (Mo) Dispersion measure, DM 339.026(3) pc cm® 
, : ; Orbital period, P, 0.2062523345(2) d 
Fig. 1| Pulsar mass-companion mass diagram for the PSRJ1913+1102 - —— - - - 
system. Shaded regions bounded by solid curves represent amass Projected semi-major axis of the pulsar’s orbit, x 1.754635(5) light s 
constraints from each measured post-Keplerian parameter, derived inthe Orbital eccentricity, e 0.089531(2) 
context of general relativity. These are: the orbital precession rate (@),thetime —_ Longitude of periastron, 283.7898(19)° 
dilati itational redshift dthe rate of orbital d >). The inset : 
ation hav ieationalnadsbibt()/and Metave of orbiialideeay}THemse Epoch of periastron passage, Ty (MJD) 57,504.5314530(10) 
shows a zoom-in of the dotted square region in the main plot, with the 30 
confidence region for the mass measurements shaded in red. The two most Total system mass, M 2.8887(6)M 
precisely measured parameters allow us to determine the individual masses of Companion mass, M, 1.27(3)M 
this system. Each additional post-Keplerian parameter measurementprovides _ Rate of periastron advance’, @ 5.6501(7)° yr" 
an independent consistency test of the predictions of general relativity. Einctein delay. 0.000471(15) s 
Orbital period decay rate®, P,, -4,8(3)x10 "ss" 
preparation). Table 1 summarizes the best-fit model parameters for Pulsar mass, M, 1.62(3) 
the PSRJ1913+1102 system. Mass ratio, q 0.78(3) 
PSRJ1913+1102 is part of a population of several very compact DNS _oypital inclination angle, i 553° 
‘ : P are 
binary systems with moderate orbital eccentricities eo) and ie Disoersion-deriveddistance*d 714 kpe 
proper motions (for example, PSRs JO737-3039A/B*, J1756-2251 er Bhan Teer ear! 10070) eas 
and J1946+2052"). These imply an evolutionary path in which the _“o-clmensiona' systemic peculiar ve —— 7 Misr a 
: , , 5 
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0.006 - Values in parentheses represent the 10 (68% confidence) uncertainty on the last quoted digit. 
Unless otherwise noted, measured parameters were determined using the DDGR timing 
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FA , Julian date. 
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ina 0.004 4 Deruelle timing model**. 
a ’Distance is derived using a model of the Galactic ionized electron density”®, with estimated 
§ uncertainty of 20%. 
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S its derivative”. 
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helium star progenitor having undergone a supernova with very little 
0.000 mass loss and low natal kick°”, owing to either a rapidiron core collapse 
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Fig. 2 | Probability density of the population of PSRJ1913+1102 -like DNS 
systems in the Galaxy, asa fraction of the total number of DNSs that will 
merge within a Hubble time. We find this fraction to be 0.11'9-34, where the 
uncertainty represents the 90% confidence interval (vertical dashed lines). The 
quoted value is the median of the distribution, shown on the plot asasolid 
vertical line, and the peak value of 0.06 is represented by a dotted vertical line. 
This implies that roughly 1in10 merging DNS systems are likely to have 
asymmetric component masses. 
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event or electron capture onto an oxygen-neon-magnesium core” “, 


Either of these scenarios lead to a low-mass (<1.3M.) neutron star’, 
which is confirmed by our measurements. Furthermore, we estimate 
the tangential component of the peculiar velocity of the system to 
be 100 +70 kms ‘(95% confidence); although the uncertainty is still 
large, this hints at a relatively low kick velocity arising from the second 
supernova compared to the overall pulsar population. 

The PSRJ1913+1102 mass ratio makes it the most asymmetric among 
the known DNS binaries that are expected to merge within a Hubble 
time, which otherwise have gz 0.9. Considering all known DNS systems, 


0.5 


Residuals (ms) 


2013 2014 2015 


Fig. 3 | Post-fit timing residuals for PSRJ1913+1102. These are obtained after 
including all best-fit parameters in the Damour and Deruelle General Relativity 
(DDGR) model*** ephemeris for this pulsar. Each contributing instrument is 
represented by different colours: Mock spectrometer centred at 1,300 MHz 
(orange) and at 1,450 MHz (red); PUPPI centred at 1,400 MHz in incoherent 
mode (purple); PUPPI centred at 1,400 MHzin coherent-fold mode (yellow); 


the only one having a similar mass asymmetry is PSRJ0453+1559, with 
q= 0.753 + 0.005; however, its orbital period of 4.07 days implies a 
coalescence time about 100 times greater than the age of the Universe. 
By contrast, PSRJ1913+1102 has an expected time to coalescence of 
470‘? Myr, which we determined from the orbital decay rate and other 
measured orbital elements. 

There are currently nine confirmed compact DNS binaries that are 
predicted to merge within a Hubble time, for which precise neutron-star 
mass measurements have been made’®. We performed a population 
synthesis analysis for these DNS systems, including PSR J1913+1102, 
using their individual properties and known masses (see also Methods). 
We found that PSR J1913+1102-like binaries represent 11*3! %of merging 
DNS systems (Figs. 2,3), where the quoted value is the median and the 
errors represent the 90% confidence intervals. This therefore estab- 
lishes the existence of a population of asymmetric DNS systems that 
is sizable enough to potentially lead to the discovery of several cor- 
responding merger events by ground-based gravitational-wave obser- 
vatories such as LIGO/Virgo. As such, the discovery of PSRJ1913+1102 
provides evidence for the need to account for asymmetric coalescing 
DNS binaries—approximately one-tenth of events—in interpeting 
merger scenarios and the underpinning physics. 

Observations of the electromagnetic counterparts to the GW170817 
event have largely been related to a relatively large amount of ejecta 
due to the preceding DNS merger, with masses of the order of 0.05M, 
(refs. *1°). This is at odds with standard models of DNS coales- 
cence, which typically predict smaller ejecta mass by at least a 
factor of 5, primarily on the basis of the assumption of equal-mass (or 
near-equal-mass) progenitor DNS binary systems*”"®. Until recently, 
this has been a reasonable assumption given the known DNS popu- 
lation’. It is plausible that the anomalously massive ejecta inferred 
from the observed late-time emission in the case of GW170817 may 
be explained with an equal-mass system, particularly ifoneincludesa 
secular component to the ejecta” *. The merger may also be explained 
by theinvocation of various models to describe the observations. These 
include an off-axis jet froma short y-ray burst”; a mildly relativistic 
wide-angle outflow that interacts with the dynamic ejecta”; a hyper- 
massive neutron-star remnant of the merger acting as a spin-down 
energy source”°”’; and a hierarchical triple system in which a disk is 
formed froma Roche lobe-filling outer star”’. 

By contrast, numerical simulations have shown that high-asymmetry 
(0.65s qg< 0.85) systems will naturally produce larger tidal distortions 
during the merger phase and result in larger-mass, and therefore brighter, 
disks than those of roughly equal-mass systems”? ™, The resulting tidal 
effects consistently produce sufficient neutron-rich ejecta to power akil- 
onovaand result inan enhancement of r-process material”. A sufficiently 
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Year 


and PUPPI centred at 2,350 MHz in coherent-fold mode (cyan). The latter 
provided a substantial improvement in data quality, as evidenced by the 
reduction in the weighted root-mean-square timing residuals to 48 ps, down 
from 72 ps at 1,400 MHz incoherent mode. The error bars shown reflect the lo 
(68%) uncertainties of each data point. 


unequal-mass binary that will merge within a Hubble time may therefore 
be responsible for events such as GW170817, particularly in the slow, 
~0.04M, red component seen in the latter*. A substantial population 
of asymmetric DNSs such as PSR J1913+1102 would therefore lead to 
an enhanced detection rate of bright kilonovae. The electromagnetic 
counterparts to such events are therefore particularly important for 
understanding the Galactic heavy-element abundance**?**, 
Enhanced pre-coalescence tidal distortions due to asymmetric merg- 
ers may also allow certain neutron-star equation-of-state models to 
be ruled out through study of the gravitational-wave waveform”*> °°, 
Additionally, gravitational waves from merging DNSs have recently 
been used as distance indicators (so-called ‘standard sirens’), enabling 
an independent probe of the Hubble constant, Hj), when combined with 
radial-velocity measurements of the electromagnetic counterparts”. 
Future asymmetric DNS mergers with similar electromagnetic counter- 
parts to those of GW170817 would lead toa considerably more precise 
determination of H,—about 15 suitable detections would provide a~2% 
measurement*°—and potentially provide the means by which to resolve 
the disagreement between H, as measured from the cosmic microwave 
background“ and through local Universe analysis methods”. 


Online content 


Any methods, additional references, Nature Research reporting sum- 
maries, source data, extended data, supplementary information, 
acknowledgements, peer review information; details of author con- 
tributions and competing interests; and statements of data and code 
availability are available at https://doi.org/10.1038/s41586-020-2439-x. 


1. Abbott, B. et al. GW170817: observation of gravitational waves from a binary neutron star 
inspiral. Phys. Rev. Lett. 119, 161101 (2017). 

2. Abbott, B. P. et al. Multi-messenger observations of a binary neutron star merger. 
Astrophys. J. Lett. 848, 12 (2017). 

3. Abbott, B. P. et al. Estimating the contribution of dynamical ejecta in the kilonova 
associated with GW170817. Astrophys. J. Lett. 850, 39 (2017). 

4. Cowperthwaite, P. S. et al. The electromagnetic counterpart of the binary neutron star 
merger LIGO/Virgo GW170817. Il. UV, optical, and near-infrared light curves and 
comparison to kilonova models. Astrophys. J. Lett. 848, 17 (2017). 

5. | Pankow, C. On GW170817 and the Galactic binary neutron star population. Astrophys. J. 
866, 60 (2018). 

6. Tauris, T. M. et al. Formation of double neutron star systems. Astrophys. J. 846, 170 (2017). 

7. Lazarus, P. et al. Einstein@Home discovery of a double neutron star binary in the PALFA 
survey. Astrophys. J. 831, 150 (2016). 

8. Kramer, M. et al. Tests of general relativity from timing the double pulsar. Science 314, 
97-102 (2006). 

9. Ferdman, R. D. et al. PSR J1756-2251: a pulsar with a low-mass neutron star companion. 
Mon. Not. R. Astron. Soc. 443, 2183-2196 (2014). 

10. Stovall, K. et al. PALFA discovery of a highly relativistic double neutron star binary. 
Astrophys. J. Lett. 854, 22 (2018). 


Nature | Vol583 | 9 July 2020 | 213 


Article 


1. 


20. 


21. 


22) 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


Tauris, T. M. et al. Ultra-stripped type Ic supernovae from close binary evolution. 
Astrophys. J. Lett. 778, 23 (2013). 

Miyaji, S., Nomoto, K., Yokoi, K. & Sugimoto, D. Supernova triggered by electron captures. 
Publ. Astron. Soc. Jon. 32, 303-329 (1980). 

Nomoto, K. Evolution of 8-10 solar mass stars toward electron capture supernovae. 

|- formation of electron-degenerate O + NE + MG cores. Astrophys. J. 277, 791-805 (1984). 
Podsiadlowski, P. et al. The double pulsar JO737-3039: testing the neutron star equation 
of state. Mon. Not. R. Astron. Soc. 361, 1243-1249 (2005). 

Martinez, J. G. et al. Pulsar JO453+1559: a double neutron star system with a large mass 
asymmetry. Astrophys. J. 812, 143 (2015). 

Kasen, D., Metzger, B., Barnes, J., Quataert, E. & Ramirez-Ruiz, E. Origin of the heavy 
elements in binary neutron-star mergers from a gravitational-wave event. Nature 551, 
80-84 (2017). 

Hotokezaka, K. et al. Mass ejection from the merger of binary neutron stars. Phys. Rev. D 
87, 024001 (2013). 

Radice, D. & Dai, L. Multimessenger parameter estimation of GW170817. Eur. Phys. J. A 55, 
50 (2019). 

Siegel, D. M. & Metzger, B. D. Three-dimensional GRMHD simulations of neutrino-cooled 
accretion disks from neutron star mergers. Astrophys. J. 858, 52 (2018). 

Fernandez, R., Tchekhovskoy, A., Quataert, E., Foucart, F. & Kasen, D. Long-term GRMHD 
simulations of neutron star merger accretion discs: implications for electromagnetic 
counterparts. Mon. Not. R. Astron. Soc. 482, 3373-3393 (2018). 

Radice, D. et al. Binary neutron star mergers: mass ejection, electromagnetic 
counterparts and nucleosynthesis. Astrophys. J. 869, 130 (2018). 

Troja, E. et al. The X-ray counterpart to the gravitational-wave event GW170817. Nature 
551, 71-74 (2017). 

Haggard, D. et al. A deep Chandra X-ray study of neutron star coalescence GW170817. 
Astrophys. J. 848, L25 (2017). 

Mooley, K. P. et al. A mildly relativistic wide-angle outflow in the neutron-star merger 
event GW170817. Nature 554, 207-210 (2018). 

Piro, A. L. & Kollmeier, J. A. Evidence for cocoon emission from the early light curve of 
SSS17a. Astrophys. J. 855, 103 (2018). 

Metzger, B. D., Thompson, T. A. & Quataert, E. A magnetar origin for the kilonova ejecta in 
GW170817. Astrophys. J. 856, 101 (2018). 

Li, S.-Z., Liu, L.-D., Yu, Y.-W. & Zhang, B. What powered the optical transient AT2017gfo 
associated with GW170817? Astrophys. J. Lett. 861, 12 (2018). 

Chang, P. & Murray, N. GW170817: A neutron star merger in a mass-transferring triple 
system. Mon. Not. R. Astron. Soc. Lett. 474, 12-16 (2018). 

Shibata, M. & Taniguchi, K. Merger of binary neutron stars to a black hole: disk mass, short 
gamma-ray bursts, and quasinormal mode ringing. Phys. Rev. D 73, 064027 (2006). 
Rezzolla, L., Baiotti, L., Giacomazzo, B., Link, D. & Font, J. A. Accurate evolutions of 
unequal-mass neutron-star binaries: properties of the torus and short GRB engines. Class. 
Quantum Gravity 27, 114105 (2010). 


214 | Nature | Vol583 | 9 July 2020 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


Al. 


42. 


43. 


44. 


45. 


46. 


47. 


Dietrich, T. & Ujevic, M. Modeling dynamical ejecta from binary neutron star mergers and 
implications for electromagnetic counterparts. Class. Quantum Gravity 34, 105014 
(2017). 

Lehner, L. et al. Unequal mass binary neutron star mergers and multimessenger signals. 
Class. Quantum Gravity 33, 184002 (2016). 

Tanvir, N. R. et al. A ‘kilonova’ associated with the short-duration y-ray burst GRB 
130603B. Nature 500, 547-549 (2013). 

Just, O., Bauswein, A., Pulpillo, R. A., Goriely, S. & Janka, H.-T. Comprehensive 
nucleosynthesis analysis for ejecta of compact binary mergers. Mon. Not. R. Astron. Soc. 
448, 541-567 (2015). 

Read, J. S. et al. Measuring the neutron star equation of state with gravitational wave 
observations. Phys. Rev. D 79, 124033 (2009). 

Lackey, B. D. & Wade, L. Reconstructing the neutron-star equation of state with 
gravitational-wave detectors from a realistic population of inspiralling binary neutron 
stars. Phys. Rev. D 91, 043002 (2015). 

Agathos, M. et al. Constraining the neutron star equation of state with gravitational wave 
signals from coalescing binary neutron stars. Phys. Rev. D 92, 023012 (2015). 

Abbott, B. P. et al. GW170817: measurements of neutron star radii and equation of state. 
Phys. Rev. Lett. 121, 161101 (2018). 

The LIGO Scientific Collaboration et al. A gravitational-wave standard siren measurement 
of the Hubble constant. Nature 551, 85-88 (2017). 

Hotokezaka, K. et al. A Hubble constant measurement from superluminal motion of the 
jet in GW170817. Nat. Astron. 3, 940-944 (2019). 

Planck Collaboration. Planck 2018 results. VI. Cosmological parameters. Preprint at 
https://arxiv.org/abs/1807.06209 (2018). 

Riess, A. G., Casertano, S., Yuan, W., Macri, L. M. & Scolnic, D. Large Magellanic Cloud 
Cepheid standards provide a 1% foundation for the determination of the Hubble constant 
and stronger evidence for physics beyond CDM. Astrophys. J. 876, 85 (2019). 

Damour, T. & Deruelle, N. General relativistic celestial mechanics of binary systems. |. 
The post-Newtonian motion. Ann. |.H.P. Phys. Theor. 43, 107-132 (1985). 

Damour, T. & Deruelle, N. General relativistic celestial mechanics of binary systems. II. 
The post-Newtonian timing formula. Ann. |.H.P. Phys. Theor. 44, 263-292 (1986). 
Damour, T. & Taylor, J. H. Strong-field tests of relativistic gravity and binary pulsars. 

Phys. Rev. D 45, 1840-1868 (1992). 

Yao, J. M., Manchester, R. N. & Wang, N. A new electron-density model for estimation of 
pulsar and FRB distances. Astrophys. J. 835, 29 (2017). 

Lorimer, D. R. & Kramer, M. Handbook of Pulsar Astronomy Vol. 4 (Cambridge Univ. Press, 
2004). 


Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations. 


© The Author(s), under exclusive licence to Springer Nature Limited 2020 


Methods 


Timing analysis 

PSRJ1913+1102 was discovered in 2012 by Einstein@Home in data from 
the PALFA survey’, which uses the William E. Gordon 305-m radio tel- 
escope at the Arecibo Observatory in Puerto Rico to search for pulsars 
within 5° of the Galactic plane. Since the discovery, we have been using 
the Arecibo Observatory to regularly monitor this pulsar with the Mock 
Spectrometer and the PUPPI pulsar backend system. This has been 
done as botha dedicated follow-up campaign for PSR J1913+1102 and 
regularly as a test source before PALFA observing sessions. 

Thetimes of arrival of these pulses were measured by cross-correlating 
each pulse profile witha noise-free representative template profile, from 
which we calculated a phase shift and apply it tothe observed time stamp 
of the data profile*®. The uncertainty in each resulting pulse arrival time 
was determined by adopting the error in the calculated phase shift from 
the aforementioned correlation procedure. This process was carried 
out using the PSRCHIVE suite of analysis tools”. Corrections between 
terrestrial time and the observatory clock were applied using data from 
Global Positioning System satellites and the Bureau International des 
Poids et Mesures. Our model also included input from the Jet Propulsion 
Laboratory DE436 Solar System ephemeris, in order to convert meas- 
ured arrival times to the reference frame of the Solar System barycentre, 
by taking into account the motion of Earth. 

These barycentred times of arrival were then compared toa predic- 
tive model of their expected arrival at Earth using the TEMPO pulsar 
timing software package (https://github.com/nanograv/tempo). Every 
rotation of the neutron star was enumerated relative to a reference 
observing epoch by accounting in our model for intrinsic pulsar prop- 
erties, such as the rotation frequency and its spin-down rate, as well 
as its sky position and proper motion. We also addressed potential 
arrival-time delays due to the frequency-dependent refractive effect 
of the ionized interstellar medium by including the dispersion measure 
(DM) in our timing model, which is the integrated column density of 
free electrons along the line of sight between Earth and the pulsar. 
The relatively large value DM = 339.026 + 0.005 pc cm” for this pul- 
sar explains why initial observations, primarily taken at an observing 
frequency centred at 1.4 GHz, displayed evidence of interstellar scatter- 
ing that resulted in considerable smearing of the observed pulse shape 
due to multi-path propagation of the signal onits way to Earth’. This led 
to increased systematic uncertainties in the derived pulsar parameters, 
and we therefore switched to observations with the higher-frequency 
S-band Low receiver (centred at 2.4 GHz with a bandwidth of 800 MHz) 
to reduce these effects. 

Along with these model parameters, our timing data resulted in an 
important measurement of the Keplerian orbital elements of the PSR 
J1913+1102 system, as well as several post-Keplerian parameters (these 
are quoted with 68% confidence-level uncertainties in Table 1). The 
latter are a theory-independent set of parameters that characterize 
perturbations on the Keplerian description of the orbit in the relativistic 
regime**. As described in the main text, we have now improved the 
measurement precision of the orbital precession rate and determined 
the Einstein delay. From this, we were able to constrain the individual 
masses of the neutron stars in this system by assuming that general 
relativity isthe correct theory of gravity, as encapsulated inthe DDGR 
timing model****. We have also made a precise determination of the 
orbital decay rate due to the emission of gravitational waves, which 
serve to remove orbital energy from the system over time. We expect 
the precision of the orbital decay to improve rapidly over time ¢ (scaling 
as ¢°) with further observations. It should also be noted that sources 
of kinematic biases can be introduced into the measured orbital decay 
and pulsar spin-down rates from apparent acceleration of the pulsar 
due to its tangential motion (that is, the Shklovskii effect*°) and the 
Galactic potential”. We find the total proper motion of this pulsar 
to be 9.3 +0.9 mas yr“, within 30 of what would be expected if the PSR 


J1913+1102 system were in the local standard of rest (6.50 mas yr’); 
this assumes a distance of 7.14 kpc. This distance is estimated from 
the measured value of DM, using a model of the Galactic free-electron 
density distribution*. The total kinematic bias to the observed orbital 
decay corresponds to approximately one-third of the uncertainty in 
the orbital decay measurement. We are therefore confident that our 
measurements are consistent with intrinsic parameter values for the 
pulsar at the current level of uncertainty. 

Once we apply our model to the dataset, we produce post-fit timing 
residuals—the difference between the predicted and observed pulse 
arrival times (Fig. 3). The timing precision achieved by this fit to our 
pulse arrival time data is characterized by the root-mean-square (r.m.s.) 
of the post-fit timing residuals. Our analysis of the PSRJ1913+1102 data- 
set resulted in r.m.s. residuals of 56.1 ps, consistent with the typical 
measured uncertainty in the observed pulse arrival times. We achieved 
areduced y’ (that is, y? divided by the number of degrees of freedom) of 
1.01 for our fit, reaffirming the success of our timing model in describing 
the system, and implying that the timing residuals can be well repre- 
sented by white Gaussian noise, as can be seen in Fig. 3. 


Population synthesis 

Modelling of the merger event that caused GW170817 has mostly relied 
ona DNS population consisting of roughly equal-mass neutron stars. 
Although this may be the result of a binary system with pre-merger 
mass ratio g = 1, the discovery of PSR J1913+1102 highlights the need 
to consider the effects of an asymmetric DNS merger. We note here for 
completeness that it is possible that GW170817 was caused by aneutron 
star-black hole merger. However, the abnormally low mass of the black 
hole in sucha progenitor system would make this an unlikely scenario. 

Previous studies** ~”’ simulated the population of DNS binaries from 
the measured parameters of knownsystems within a modelled Galac- 
tic pulsar population; this was done using the known sensitivities of 
the pulsar survey in which they were discovered. The modelling must 
account for selection effects, including the search degradation factor 
due to orbital acceleration, calculated from a semi-analytical model 
with the pulsar and companion masses and the system inclination as 
input°’. We calculated the probability density of the population of PSR 
J1913+1102-like DNSs that are beamed towards Earth (N45 1913) using 
the more precisely measured orbital properties presented in this work. 
Assuming a beaming correction fraction for the pulsar® of f, =4.6, we 
derived the probability density of the total population (NV, j1913) of 
J1913+1102-like DNS systems in the Galaxy: (Nop y1913 = Nobs,j1913 Xf) The 
mode of the resulting distribution is No.9; = 700°435°, where the 
uncertainties represent the 90% confidence interval of the distribution. 
This is consistent with previous estimates”, but has smaller error bars 
owing to the updated orbital parameters and the addition of anew 
radio pulsar survey”. 

Owing to its small orbital period, the PSRJ1913+1102 system will merge 
in 470 Myr, well within one Hubble time. There are eight other known 
DNS systems in the Galaxy that will also merge within the age of the 
Universe (which we henceforth refer to as merging DNSs, MDNSs). We 
obtained the individual probability densities of the population of these 
MDNSs using results given in previous studies”. Assuming that these 
individual population distributions represent independent continuous 
random variables, we estimate the total population of MDNS systems 
in the Galaxy by convolving the individual population probability dis- 
tributions, resulting in a mode Myo. mpns = (11.4"$3) x 10°. Using our 
derived probability densities of PSR J1913+1102-like systems together 
with those of all MDNS systems, we then compute the probability den- 
sity of PSRJ1913+1102-like DNS systems in the Galaxy as a fraction of the 
MDNS population to be 11*2! % (90% confidence), using the median as 
the quoted value (with the mode of the distribution occurring at 6%); 
Fig. 2 presents the corresponding probability distribution. This inturn 
leads to an estimate that roughly one-tenth of detected DNS mergers 
result from the coalescence of an asymmetric binary system. 
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com/nanograv/tempo (pulsar timing analysis); https://github.com/Nih 
anPol/2018-DNS-merger-rate (population synthesis); https://github. 
com/rferdman/pypsr (plotting tools). 
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® Check for updates 


The recent discovery of correlated insulator states and superconductivity in 
magic-angle twisted bilayer graphene’” has enabled the experimental investigation of 
electronic correlations in tunable flat-band systems realized in twisted van der Waals 


heterostructures® °. This novel twist angle degree of freedom and control should be 
generalizable to other two-dimensional systems, which may exhibit similar correlated 
physics behaviour, and could enable techniques to tune and control the strength of 
electron-electron interactions. Here we report a highly tunable correlated system 
based on small-angle twisted bilayer—bilayer graphene (TBBG), consisting of two 
rotated sheets of Bernal-stacked bilayer graphene. We find that TBBG exhibits a rich 
phase diagram, with tunable correlated insulator states that are highly sensitive to 
both the twist angle and the application of an electric displacement field, the latter 
reflecting the inherent polarizability of Bernal-stacked bilayer graphene’®. The 
correlated insulator states can be switched on and off by the displacement field at all 
integer electron fillings of the moiré unit cell. The response of these correlated states 
to magnetic fields suggests evidence of spin-polarized ground states, in stark contrast 
to magic-angle twisted bilayer graphene. Furthermore, in the regime of lower twist 
angles, TBBG shows multiple sets of flat bands near charge neutrality, resulting in 
numerous correlated states corresponding to half-filling of each of these flat bands, 
all of which are tunable by the displacement field as well. Our results could enable the 
exploration of twist-angle- and electric-field-controlled correlated phases of matter in 
multi-flat-band twisted superlattices. 


Electronic correlations play a fundamental role in condensed-matter 
systems where the bandwidth is comparable to or less than the Coulomb 
energy between electrons. These correlation effects often manifest 
themselves as intriguing quantum phases of matter, such as ferromag- 
netism, superconductivity, Mott insulators or fractional quantum Hall 
states. Understanding, predicting and characterizing these correlated 
phases is of great interest in modern condensed-matter physics research 
and pose challenges to both experimentalists and theorists. Recent 
studies of twisted graphene superlattices have provided us with an ideal 
tunable platform to investigate electronic correlations in two dimen- 
sions!” “, Tuning the twist angle of two-dimensional (2D) van der Waals 
heterostructures to realize novel electronic states, an emerging field 
referred to as ‘twistronics’, has enabled physicists to explore a variety 
of novel phenomena” *. When two layers of graphene are twisted by 
a specific angle, the phase diagram in the system exhibits correlated 
insulator states with similarities to Mott insulator systems!”, as well as 
unconventional superconducting states upon charge doping”*""*. These 
effects might be originating from the many-body interactions between 
the electrons, when the band structure becomes substantially narrow as 
the twist angle approaches the first magic angle 9 =1.1° (refs. °°). 

Here we extend the twistronics research on graphene superlattices 
to anovel system with electrical displacement field tunability—twisted 


bilayer—bilayer graphene (TBBG), which consists of two sheets of 
untwisted Bernal-stacked bilayer graphene stacked together at an 
angle 6, as illustrated in Fig. 1a. The band structure of bilayer graphene 
is highly sensitive to the applied perpendicular electric displacement 
field”’?°, and therefore provides us with an extra knob to control the 
relative strength of electronic correlations in the bands”. Similar 
to twisted bilayer graphene (TBG)*°, the band structure of TBBG is 
flattened near about 1.1° (Fig. 2e-g)”. For devices with a twist angle 
near this value, our experiments show that the correlated insulator 
behaviour at n,/2, n,/4 and 3n,/4 can be sensitively turned on and off 
by the displacement field, where n, is the density corresponding to 
fully filling one spin- and valley-degenerate superlattice band’. 
From their response to magnetic fields, all of these correlated states 
probably have a spin-polarized nature, with the n,/2 state having a 
g-factor of about 1.5 for parallel fields, close to the bare electron spin 
g-factor of 2. In contrast, devices with a smaller twist angle of 0.84° 
show multiple displacement-field-tunable correlated states at higher 
fillings, consistent with the presence of several sets of correlated flat 
bands in the electronic structure. The combination of twist angle, 
electric displacement field and magnetic field provides a rich arena 
to investigate novel correlated phenomena in the emerging field of 
twistronics. 
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Fig. 1| Structure and transport characterization of TBBG. a, TBBG consists 
of two sheets of Bernal-stacked bilayer graphene twisted at an angle 0. 

b, Schematic of a typical TBBG device with top and bottom gates anda Hall-bar 
geometry for transport measurements. c-e, Measured longitudinal resistance 
R,.= V,,/land low-field Hall coefficient Ry = d/dB(V,,/1) as functions of carrier 
density nin three devices with twist angles 8 =1.23° (c), 1.09° (d) and 0.84° (e). 
The vertical dashed lines denote multiples of the superlattice density n,, where 
the peaking of R,,.and sign changing of R,, indicate the Fermi energy crossesa 
band edge of the superlattice bands. f, Resistance of the 1.09° TBBG device 


We fabricated high-mobility dual-gated TBBG devices with the 
previously reported ‘tear and stack’ method”, using exfoliated 
Bernal-stacked bilayer graphene instead of monolayer graphene. The 
devices presumably have an AB-AB stacking configuration where the 
top and bottom bilayers retain the same AB stacking order, in contrast 
tothe AB-BA structure that was predicted to show topological effects”. 
We measured the transport properties of six small-angle devices, and 
here we focus on three of the devices with twist angles 6 = 1.23°, 1.09° 
and 0.84° (see Extended Data Fig. 1 for other devices). The samples 
areall of high quality, as evident in the Landau fan diagrams, with Hall 
mobilities that can exceed 100,000 cm’ V's“, shown in Extended 
Data Fig. 2. Figure 1c—e shows the longitudinal resistance R,,.and the 
low-field Hall coefficient R,, = dR,,/dB versus charge density for these 
three devices at atemperature of T=4 K, where Bis the magnetic field 
perpendicular to the sample. In a superlattice, the electronic band 
structure is folded in the mini-Brillouin zone, defined by the moiré 
periodicity*. Each band in the mini-Brillouin zone canaccommodatea 
total charge density of n,=4/A, where A is the size of the moiré unit cell 
and the pre-factor accounts for the spin and valley degeneracies*”’”. 
The experimental results show a sign change in the Hall coefficient 
R,,at each multiple of n, (vertical dashed lines in Fig. 1c—e), indicating 
the switching of hole-like pockets to electron-like pockets, and peaks 
in R,,,, indicating the crossing of new band edges (for 6 = 0.84°, the 
band edges at —n, and +2n, may have only small gaps or may even be 
semi-metallic, and hence do not exhibit prominent peaks in R,.,). The 
sharpness of the peaks confirms that the devices exhibit relatively low 
disorder and have well-defined twist angles. 

In the 6 = 1.23° and 6 = 1.09° devices, we observe signatures of 
newly formed gaps at n,/2 when a displacement field D is applied 
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versus both V,, and V,,. The charge density n and displacement field Dare 
related to the gate voltages by a linear transformation (Methods). The 
superlattice densities +n, and the half-filling at n,/2 are indicated by dashed 
lines parallel to the D axis. Correlated insulator states are observed at n,/2 
filling in finite displacement fields. CNP, charge neutrality point. g, Map of 
low-field Hall coefficient R,, (left) and resistance R,, (right) near then,/2 
correlated states for the 1.09° TBBG device (the vertical dashed lines indicate 
n,/2). We find that accompanying the onset of the correlated insulator states at 
D/é,~+0.18 V nm", anew sign change of the Hall coefficient also emerges. 


perpendicular to the device. The dual-gate device geometry allows 
us to independently vary the total charge density n and D (see Meth- 
ods for details of the transformation between gate voltages and (n, 
D)). Figure 1f shows the resistance map in the top gate voltage—bot- 
tom gate voltage (V,,-V,,) space for the @=1.09° device. At D=0,no 
insulating behaviour other than the full-filling gaps at +n, is observed. 
However, whena displacement field Dis applied in either direction, an 
insulating state appears at n,/2 for a range of |D|. This new insulating 
state induced by the displacement field is further examined by meas- 
uring the Hall coefficient R,, versus n and D, as shown in the left panel 
of Fig. 1g (9=1.09° device), and comparing with R,,. shown in the right 
panel. At the onset of the insulating states at D/é,~+0.18 Vnm™, where 
€) is the vacuum permittivity, R,, develops additional sign changes 
adjacent to the insulating states, suggesting the creation of new gaps 
by the displacement field. The insulating states disappear when D/e, 
exceeds +0.35 V nm. In both the 6 = 1.09° device and the 6 = 1.23° 
devices, we find signatures of the onset of correlated behaviour at 
n=—n,/2 and D=0, but no well-developed insulating state is observed 
(Extended Data Fig. 1, Methods). 

Inthe 8=1.23° device, we observe a similar but more intricate hierar- 
chy of tunable insulating states that stem from the interplay of correla- 
tions, the superlattice bands and the magnetic field. Figure 2a shows the 
n-Dresistance map for the 9=1.23° TBBG device measured at T= 0.07 
K. Noticeably, as |D| is increased, the insulating state at charge neutral- 
ity n=O strengthens in the same way as in the Bernal-stacked bilayer 
graphene’”*”®, while the superlattice gaps at +n, weaken and eventu- 
ally disappear (at |D|/e, > 0.6 Vnm" for the +n, insulating state and at 
|D\/e) > 0.35 Vn for the -n, insulating state). The band structures of 
TBBG in zero and finite external displacement fields calculated using 
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Fig. 2 | Displacement-field-tunable correlated insulator states in TBBG. 

a, Colour plot of resistance versus charge density n and displacement field D 
(8=1.23° device, section 1, see Methods). The green dashed line cutting 
through the D< O correlated state is the linecut along which bis taken (for the 
0=1.23° device, section 2, see Methods). b, Resistance versus nand 7 ata fixed 
D/eé,=-0.38 Vnm"*. The correlated insulator states at n,/4 and n,/2 are 
suppressed by increasing the temperature. c, Resistance at density n,/2 versus 
displacement field and temperature. The resistance shows a maximum at 
approximately D/e, = +0.4 Vnm“, the region where the correlated insulator 
state is present. The inset shows the thermal activation gap extracted from 
temperature dependence at different values of Dacross then,/2 state. 


a continuum approximation are shown in Fig. 2e-g (see Methods for 
details). It should be noted that, although TBBG has twice the number 
of graphene layers than TBG, the band counting is the same, that is, 
each band (spin and valley degenerate) accommodates four electrons 
per moiré unit cell. At zero displacement field, the calculated gap at 
the charge neutrality is negligible, while the superlattice gaps above 
and below the flat bands are non-zero. When the displacement field is 
increased, the charge neutrality gap quickly widens while the superlat- 
tice gaps become smaller and eventually vanish, in agreement with our 
experimental observations. 

At intermediate displacement fields around D/e,=—0.38 Vnm", we 
observe the insulating states not only at n,/2 over a wider range of D, but 
also at n,/4 over asmaller range (Fig. 2a). We attribute these states toa 
Mott-like mechanism similar to those observed in TBG, which results 
from the Coulomb repulsion of the electrons in the flat bands when 
each unit cell hosts exactly one or two electrons, corresponding ton,/4 
and n,/2 fillings, respectively. The n,/4 state requires a finer tuning of 
Dto be revealed, possibly due to the smaller gap size. This is evident 
from Fig. 2b, where we show the resistance versus n and temperature 
T with the displacement field D/e, fixed at -0.38 V nm”. While the 
n,/2 state persists up to approximately 8 K, the n,/4 state disappears 
at less than 3 K, indicating a smaller gap. Figure 2c shows the resist- 
ance of the n,/2 state versus the displacement field and temperature. 
The ‘optimal’ displacement field to reach the maximal resistance is 


d, Normalized resistance curves versus temperature at various densities 
between 0 andn,/2 = 1.77 x10” cm”, whichare indicated by dashed lines inb. 
Away fromthe charge neutrality point, all resistance curves show 
approximately linear R-T behaviour above 10 K, with similar slopes (Extended 
Data Fig. 3).e-g, Calculated band structure (left) and density of states (DOS; 
right) for 9=1.23° TBBG at AV=0 (e), AV=6 mV (f) and AV=12 mV (g), where AVis 
the potential difference between adjacent graphene layers induced by the 
external displacement field (assumed to be the same between all layers). 
Single-particle bandgaps in the dispersion are highlighted green (below and 
above the flat bands) and purple (at charge neutrality) bars. 


approximately +0.4 V nm“. As the temperature increases, the peak in 
R,,,not only decreases in value but also broadens in D. In the inset, we 
show the evolution of the gap versus the displacement field. At tem- 
peratures higher than 10 K and away from the charge neutrality point, 
the transport is dominated by a linear R-T behaviour similar to that 
observed in TBG (Fig. 2d, see also Extended Data Fig. 3, Methods)?”?>°, 

Figure 3 shows the response of the various correlated states to mag- 
netic fields in the perpendicular or in-plane direction with respect to 
the sample plane. Figure 3a—c shows the n—D maps of the resistance 
for the 8=1.23° device at B= OT, B, =8 T and B,=8T, respectively. The 
plots focus on densities from n= 0 ton=n,. Figure 3a shows the band 
insulator states at n=O and n=n,, as well as the correlated insulat- 
ing states at n,/2 and n,/4 (encircled by dashed lines), but not at 3n,/4 
filling at this zero magnetic field. Interestingly, at B, = 8 T (Fig. 3b), 
the correlated insulating states at n,/4 and n,/2 vanish at their original 
positions centred around D/€, = —0.38 V nm, whereas new insulat- 
ing states appear at n=n,/4, D/e, ~ -0.2 to -0.35 Vnm“, andn=n,/2, 
D/é,~-0.45to-0.6 Vnm*, above and below their original positions at 
B=0, respectively. Anew correlated insulating state also now appears 
at 3n,/4, D/é, = —0.4 to —0.5 V nm. However, no such strong shift is 
observed with in-plane magnetic field (Fig. 3c). At B, = 8 T, the corre- 
lated insulating states are clearly visible at all integer electron fillings 
(n,/4, n,/2, 3n,/4) near D/e, = -0.38 V nm“. Figure 3d, e shows the evo- 
lution of the n,/2 insulating state as a function of B, and B,. An abrupt 
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Fig. 3 | Magnetic field response of the displacement-field-tunable 
correlated insulator states in TBBG. a-c, Resistance plot for the @=1.23° 
TBBG device in magnetic fields of B=0 (a), B, = 8 T perpendicular to the sample 
(b) and B,=8T parallel to the sample (c). All measurements are taken at sample 
temperature 7= 0.07 K. Various correlated states at integer electron fillings of 
the moiré unit cell are indicated by dashed circles. At zero field, only the n,/4 
and n,/2 states appear around |D|/e)=0.38 Vnm "(denoted by blue dashed 
lines). Ina perpendicular field of 8 T, the n,/4 state shifts towards lower |D|, the 
n,/2 state shifts towards higher |D| and a3n,/4 state also emerges. Ina parallel 
field of 8 T, however, the position of the states barely shifts but their resistance 
increases monotonically. d, e, Resistance at n=n,/2 versus displacement field 


shift in the range of D for which the insulating state appears occurs at 
B, =5T, whereas the insulating state strengthens monotonically with 
the in-plane magnetic field. 

The key difference between the effects of the perpendicular and 
in-plane magnetic fields lies in the fact that the lateral dimension of the 
unit cell in TBBG, about 10 nm, is much larger than the thickness of the 
system, about Inm. Therefore, while both fields couple equally to the 
spins of the correlated electrons, B, has amuch weaker (but non-zero) 
effect on the orbital movement of the electrons. To theoretically under- 
stand the behaviour of the correlated insulating states ina magnetic 
field, we first have to identify their ground state. Figure 3f, g shows the 
evolution of the thermal activation gap of then,/2 state in bothB, and 
B,. We find ag-factor of g, ~3.5 for the perpendicular direction (up to 
5 T before the shift occurs) and a g-factor of g, ~ 1.5 for the in-plane 
direction. g; is close to (but less than) g = 2, which is expected for a 
spin-polarized ground state with contribution from only the electron 
spins. This difference is theoretically expected because of finite in-plane 
orbital effects””. Therefore, on the basis of these measurements, we may 
conclude that the correlated insulating states have a spin-polarized 
nature. These observations establish TBBG as a distinctive system from 
the previously reported magic-angle TBG system’””, which exhibits 
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and magnetic field applied perpendicular (d) and in-plane (e) with respect to 
the device. While the correlated insulator state monotonically strengthensin 
B,, the perpendicular field induces a phase transition at around B, =5T, where 
the correlated state abruptly shifts to higher |D|.f, g, Temperature dependence 
of the resistance at the n,/2 insulator in perpendicular (f) and in-plane (g) 
magnetic fields. The insets show the thermal activation gaps extracted from 
the Arrhenius fits (R=e 287, where k,is the Boltzmann constant) in the main 
figures (solid lines) versus the magnitude of the field in the respective 
orientation. Error bars correspond toa confidence level of 0.99. The linear fit 
of the thermal activation gap gives ag-factor of about 3.5 for the perpendicular 
field (up to 5 T only) and 1.5 for the in-plane field (entire field range). 


half-filling insulating states that are shown to be spin unpolarized, as 
they are suppressed by an in-plane magnetic field. InB,, however, one 
would expect orbital effects to have a more substantial role. We may 
attribute the larger g, of about 3.5 to exchange-induced enhancement 
effects, similar to what is observed in Landau levels of gallium arsenide 
quantum wells and graphene”®*”. In Extended Data Fig. 4, we provide 
additional magnetic field response data for the n,/4 and the 3n,/4 states. 
Both of these states also exhibit a spin-polarized behaviour, as they 
become more resistive under the in-plane magnetic field. 

In addition to the discussion above, we noticed that all the cor- 
related insulating states in the 0 = 1.23° TBBG device, whether at 
zero magnetic field or high magnetic fields, lie within the range 
D/é,=—0.6 to -0.2 V nm“. Coincidentally, this is also the range where 
both the gap at the charge neutrality (n = 0) and the gap at the super- 
lattice density (n = n,) are well developed (that is, the case in Fig. 2f). 
On the basis of this observation, we suggest that the displacement 
field tunability of the correlated states is tied to the modulation of the 
single-particle bandgaps by the displacement field”. When either gap 
atn=Oorn=n, is absent, the thermally excited or disorder-scattered 
carriers from the upper or lower bands would suppress the ordering 
of the electrons and hence the correlated states. Further theoretical 
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Fig. 4| Correlated insulator states ina multi-flat-band system. 

a, b, Calculated band structure of 9=0.84° TBBG without an interlayer 
potential (a) and with an interlayer potential AV=18 mV (b). Near charge 
neutrality, within a50-meV window, there are in total six sets of flat bands 
spanning densities —3n, to 3n,. Upon applying a displacement field, these bands 
are further flattened and separated from each other, which makes them more 
prone to giving rise to correlated states at each half-filling. c, Resistance map of 
a@=0.84° TBBG device measured at T= 0.07 K. The top axis is the charge 
density normalized to the superlattice density n,. Besides the D-tunable gaps at 
multiples of n,, we find signatures of correlated states at n/n, =~—1/2, -1/4 for 
|D|/€9> 0.4 Vnm", which are indicated by dashed circles. d, Resistance asa 


work is needed to reveal the detailed structure of the displacement 
field dependence of the correlated states. 

We have also investigated the regime of substantially smaller twist 
angles. Unlike the case of TBG, further reduction of the twist angle of 
TBBG to 0.84° results not in one, but rather three pairs of flat bands, 
separated from other bands by bandgaps (Fig. 4a). The application of 
anelectrical displacement field further flattens these bands and sepa- 
rates them from each other (Fig. 4b). This would imply that all electrons 
within the density range —3n, to +3n, might experience strong Cou- 
lomb interactions and that their correlations can get further enhanced 
by applying a displacement field. These predictions from the band 
theory are consistent with our experimental observations. In Fig. 4c, 
where we show the resistance map of the 6 = 0.84° TBBG device ver- 
sus nand D, we indeed find that the weak signatures of the —n,/2 and 
-n,/4 correlated insulating states appear only at high displacement 
fields |D|/e)> 0.4 Vm‘ (encircled by white dashed lines). The full-filling 
gaps at +n, and +2n, are tunable by the displacement field to different 
extents as well. 

As we turn ona perpendicular magnetic field, a series of correlated 
insulator states appear across the entire density range spanning the 
multiple flat bands. Figure 4d, e shows the Landau fan diagrams at 
D/e,=0.6V mand D=0, respectively. At zero displacement field, the 
Landau fan shows a complicated Hofstadter butterfly pattern due to 


function of charge density and perpendicular magnetic field Bwhena 
displacement is present, D/e,=0.6 Vnm™. We find clear correlated states at 
n/n,=~1/2,1/4 and 1/2, and also evidences at 3/2 and 5/2 fillings, as indicated by 
arrows (blue and green arrows indicate half-fillings and quarter-filling, 
respectively). e, For comparison, when no displacement field is present, we do 
not find any signature of half-filling correlated states. Owing to the formation 
of asuperlattice, we also observe Hofstadter butterfly related features whenB 
is such that the magnetic flux in each unit cell is equal to @p/2, @o/3, Po/4 and so 
on, where @,=h/eis the flux quantum, A and e being Planck’s constant and 
electron charge, respectively. 


commensurate flux threading into the unit cell’? “ (see also Methods, 
Extended Data Fig. 2), but no correlated state is observed at half-fillings 
or quarter-fillings. We note that a resistive region appears atn = 1.637, 
in Fig. 4e, which does not coincide with any commensurate filling 
and might be ascribed to twist-angle inhomogeneity in the sample. 
In contrast, at D/e,=0.6 Vnm“, we find clear signatures of correlated 
states at n,/2 and —n,/2 in the centre flat bands, and weak evidences 
at 3n,/2 and 5n,/2 in the upper flat bands. All of these half-filling cor- 
related states appear to be enhanced by the application of a perpen- 
dicular magnetic field, which we attribute to the same spin/orbital 
combined enhancement of the correlated gaps as in the n,/2 state of 
the 0 =1.23° device (Fig. 3f). The correlated states at +n,/2 appear to 
be much stronger than the states at 3n,/2 and 5n,/2 in high magnetic 
fields, consistent with the fact that from our calculations, the pair of 
bands closer to charge neutrality is much flatter than the other two 
pairs farther away from charge neutrality, as can be seen in Fig. 4b. 
The resistance of the quarter-filling state at n,/4, however, does not 
increase monotonically with the perpendicular field, but rather even- 
tually gets suppressed at 5 T. 

Our results show that TBBG exhibits a rich spectrum of correlated 
phases tunable by twist angle, electric displacement field and magnetic 
field, enabling further studies of strongly correlated physics and topol- 
ogy in multi-flat-band systems”. 
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Methods 


Fabrication and measurement 

The reported devices were fabricated with two sheets of Bernal-stacked 
bilayer graphene and encapsulated by two hexagonal boron nitride 
(hBN) flakes. Both bilayer graphene and hBN were exfoliated on 
SiO,/Si substrates, and the thickness and quality of the flakes were 
confirmed with optical microscopy and atomic force microscopy. A 
modified polymer-based dry pick-up technique was used for the fab- 
rication of the heterostructures. A poly(bisphenol A carbonate) (PC)/ 
polydimethylsiloxane (PDMS) layer on a glass slide was positioned 
in the micro-positioning stage to first pick up an hBN flake at around 
100 °C. The van der Waals interaction between the hBN and bilayer 
graphene then allowed us to tear the bilayer graphene flake, which 
was then rotated at a desired angle and stacked at room temperature. 
The resulting hBN/bilayer graphene/bilayer graphene heterostructure 
was released on another hBN flake ona palladium/gold back gate that 
was pre-heated to 170 °C, using a hot-transfer method*°*". The desired 
geometry of the devices was achieved with electron beam lithography 
and reactive ion etching. The electrical contacts and top gate were 
deposited by thermal evaporation of chromium/gold, making edge 
contacts to the encapsulated graphene. 

Electronic transport measurements were performed in a dilution 
refrigerator with a superconducting magnet, with a base electronic 
temperature of 70 mK. The data were obtained with low-frequency 
lock-in techniques. We measured the current through the sample 
amplified by 10’ V A‘ and the four-probe voltage amplified by 1,000, 
using SR-830 lock-in amplifiers that were all synchronized to the same 
frequency between around 1 and 20 Hz. For resistance measurements, 
we typically used a voltage excitation of less than 100 pV or current 
excitation of less than10 nA. 


List of measured TBBG devices 

Following the definition given in the main text and accounting for off- 
sets in the gate voltages due to impurity doping, nand Dare related to 
the top and bottom gate voltages V.,and Vg by 


n= [Crp Vig ~ Vig o) + Cogl Vog in Vog o) I/e 


D=[- Cro Vig a: Vigo) + Cog Vog - Vog,0) 1/2 

Extended Data Table 1 lists the twist angles and parameters c,, (top 
gate capacitance per area), C,, (bottom gate capacitance per area), 
Vigo (top gate voltage offset), Vi. (bottom gate voltage offset) and n, 
(superlattice density) for all devices discussed in this work, including 
those shown in the Extended Data figures. e is unit electron charge. 
These parameters are estimated to satisfy that all diagonal featuresin 
the V,,-V,, maps are rotated to be vertical in the corresponding n-D 
maps, and the features should be symmetrical with respect to D=0 
after the transformation. 

In Extended Data Fig. la-f, we show V,,-V,, resistance maps for all six 
TBBG devices we measured. Extended Data Fig. 1c, d was measured in 
the same TBBGsample, but in different sample regions that are approxi- 
mately 27 um apart (sections 1 and 2, respectively). Both regions have 
identical parameters (hence the two identical rows in Extended Data 
Table 1), with the same twist angle 6 = 1.23°, and also nearly identical 
transport characteristics. The two sections are electrically discon- 
nected via etching, but the extracted twist angles from the data have 
a difference of less than 0.01°, suggesting very uniform twist angles 
across this entire sample. 

In almost all TBBG samples, we noticed a peculiar cross-like pattern 
around (n, D) = (-n,/2, 0), that is, near p-side half-filling of the super- 
lattice band. This is especially apparent in the 1.09° and 1.23° devices, 
which are highlighted in Extended Data Fig. 1g, h. The p-side band 
does not exhibit a strong D-tunable correlated state as elaborated in 


the main text, possibly due to the larger bandwidth compared with its 
n-side counterpart. This cross-like pattern might represent an onset 
of correlated behaviour near half-filling of the band. Further experi- 
mental work and theoretical insight are needed to understand this 
phenomenon. 


Sample quality and Landau fans 

To demonstrate the high quality of our fabricated TBBG devices, we 
measured the Landau fan diagrams and Hall mobilities of all three 
devices discussed in the main text, as shown in Extended Data Fig. 2. 
The Hall mobilities are extracted from the ratio between the Hall 
coefficient R,, and longitudinal resistance at small magnetic fields 
(B<0.5T). All three samples exhibit high Hall mobilities close to or 
above 100,000 cm’?V's7. 

All three devices also show clear Landau fans starting from about 
IT. The filling factor of each level is labelled in the lower panels of 
each plot. In particular, due to the lower angle of the 0 = 0.84° device, 
its Landau fan displays a complicated Hofstadter’s butterfly pattern 
starting from 3 T. 


Linear R-T behaviour 

Extended Data Fig. 3 shows the resistance versus temperature behav- 
iour, at different densities, observed across several small-angle 
TBBG devices. In the 1.23° device, we find approximately linear R-T 
behaviour above 10 K for densities ranging from around 0.5 x 10” to 
2.5 x 10” cm”, encompassing the n./2 correlated state. The resist- 
ance slope in this range of densities does not vary very substan- 
tially, ranging from around 210 to 350 O K+. As all our devices have 
length-to-width ratios close to one, these slope values are therefore 
close to those reported in TBG™”*. In stark contrast, the resistance 
behaviour in the hole-doping side (n< 0), as shownin Extended Data 
Fig. 3b, shows qualitatively different behaviour: it does not show linear 
R-T characteristics, at least up to 30 K, and the resistance value is 
about an order of magnitude smaller than onthe electron-doping side. 
These data are consistent with the picture that the electron-doping 
band is flatter than the hole-doping band, therefore exhibiting more 
pronounced correlated phenomena, examples being the n,/2 insula- 
tor state and the linear R-7 behaviour. Extended Data Fig. 3c shows 
R-T curves close to the n,/2 state. 

The data for the 1.09° device show a similar trend of linear R-7 behav- 
iour starting around 5-10 K, as shown in Extended Data Fig. 3d. 

In the 0.84° device, we find a very different behaviour. There is a 
region of sublinear or approximately linear R-7 behaviour at all densi- 
ties, except at multiples of n,, but the resistance slope is now strongly 
dependent on the charge density n. The slope approximately follows 
a power law Chae « n* where a= -1.77 (see inset). 


Theoretical methods 

The band structures shown in the main text are calculated using acon- 
tinuum model based on the original continuum model for TBG*”, which 
qualitatively captures most of the important features of the bands in 
TBBG including displacement-field dependence. To the lowest order, 
the continuum model of twisted graphene superlattices is built onthe 
approximation that the interlayer coupling between the A/B sublattice 
of one layer and the A/B sublattice of the other layer has a sinusoidal 
variation over the periodicity of the moiré pattern. For the three pos- 
sible directions of interlayer connections between the wave vectors in 
the Brillouin zone, there are three connection matrices 
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H,=w i i 
2 
H,=wW e 1) 
@ @ 
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where wis the interlayer hopping energy and w = exp(2T1i/3). H; ag, with 
aB=A,Brepresents the hopping between sublattice ain the first layer 
to sublattice 6 inthe second layer, with momentum transfer determined 
by i (see ref. * for definition). Note that in this gauge choice, the origin 
of rotation is chosen where the B sublattice of the first layer coincides 
with the A sublattice of the second layer, so that the H,,, component 
has zero phase while the other terms acquire phases. A different gauge 
choice is equivalent to an interlayer translation, which has been shown 
to have a negligible effect in the case of small twist angles*». 

To extend this formulation to TBBG, we add a simplified bilayer gra- 


phene Hamiltonian 
0 O 
Holey 0 


between the non-twisted layers. The momentum transfer is zero since 
the bilayers are not twisted and the coupling is constant over the moiré 
unit cell. For simplicity, we consider only the ‘dimer’ coupling in the 
bilayer, neglecting second-nearest-neighbour hopping terms and 
trigonal warping terms. The two bilayers in TBBG (layers 1-2 and lay- 
ers 3-4) have the same stacking order, that is, for zero twist angle the 
total stacking would be ‘ABAB’ instead of ‘ABBA. In the calculations 
used in the main text, we used parameters w=0.1eV and w, = 0.4 eV, 
so that when either parameter is turned off we obtain either the two 
non-interacting bilayer graphene (w = 0) or the non-interacting TBG 
and two-monolayer graphene (w, = 0). 


Additional magnetic-field-response data 

Extended Data Fig. 4 shows the response of correlated states at n,/4 and 
3n,/4 in a perpendicular or in-plane magnetic field, similar to Fig. 3d, 
e, for the 8 =1.23° device. For the n,/4 state, we also find a signature 
of a phase transition at D/é, = -0.36 Vnm", manifesting as a shift of 
the D location of the correlated insulator as B, exceeds 6 T. The 3n,/4 
state shows an overall monotonic increase of resistance and exhibits 
no shift in the position in D. In an in-plane field, however, as shown in 
Extended Data Fig. 4b, d, both quarter-filling states show a monotonic 
enhancementas B, is increased, suggesting that they may havea similar 
spin-polarized ground state as the n,/2 state. 


Current-voltage curves and the impact of excitation current on 
g-factor 
In Extended Data Fig. 5, we have plotted the current-voltage (/-V) 
curves and differential resistance curves of the @ = 1.23° device when 
itis in the correlated insulator states at n,/4 and n,/2. In the insulator 
states, we find a highly nonlinear region near zero d.c. bias /, =O where 
the differential resistance dV,,/d/, is substantially enhanced. This isin 
agreement with the existence of a small energy gap, whichis overcome 
at higher bias voltages/currents. Outside of the insulator regions (such 
as shown in Extended Data Fig. 5b), the /-V curves are mostly linear. 
For measuring the g-factors at n,/2, we therefore used a much smaller 
excitation current of 0.1nA to truthfully measure the differential resist- 
ance at/,=0. 

We comment here on the effect of the a.c. excitation current on the 
measured gap sizes and the g-factor. When sourcing an a.c. bias cur- 
rent to measure the resistance using a lock-in technique, we effectively 


measure a weighted average of the differential resistance near zero 
bias. Owing to the highly nonlinear /-V curve at the n,/2 state, if the 
a.c. excitation is large, this average value will be much less than the 
peak value. Furthermore, the average value measured in this case can 
have a very different temperature dependence compared with the 
zero-bias value. For example, although to the best of our knowledge 
there is no detailed analysis of the high-bias behaviour in the correlated 
insulator state of TBG, TBBG or related systems, if one considers the 
high-bias transport to have a contribution from a mechanism similar to 
Zener breakdown in semiconductors inan electrical field, the current 
is essentially independent of the temperature. There could be other 
contributions to the high-bias transport as well, but in general their 
temperature dependence would not be identical to the zero-bias peak. 
Inthe Arrhenius fit that we use to extract the gap size, the gap size A is 
basically equal to how fast the resistance exponentially rises with 7". 
Therefore, a reduction of temperature dependence means that by 
averaging the higher bias differential resistance one would substan- 
tially underestimate the energy gap 4, and also theg-factor g~ 6A/5B. 

In Extended Data Fig. 6, we compare the Arrhenius fits of the resist- 
ance atn,/2 and n,/4 states, using a small excitation (0.1nA) and alarger 
excitation (around 5-10 nA). We indeed find that by using an exces- 
sive excitation, both the gap size A and the g-factor are substantially 
underestimated. In particular, owing to the larger nonlinearity at the 
n,/2 state, its g-factor is underestimated by a factor of about three by 
using the larger excitation. Therefore, one should keep these nonlinear 
effects in mind when doing temperature-dependent measurements on 
such resistive states to obtain accurate results. 


Data availability 


The data that support the findings of this study are available from the 
corresponding authors upon reasonable request. 
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Extended Data Fig. 1| V,.-V,, resistance maps of measured TBBG devices. 
a-f, Resistance versus V,, and V,, for the six TBBG devices measured, which 
correspond to the six rows shown in Extended Data Table 1, respectively. 
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g,h, Cross-like feature near —n,/2 in TBBG samples with twist angles 0=1.23° (g) 
and 8=1.09° (h), which might signal the onset of acorrelated state. 
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Extended Data Fig. 2 | Landau fan diagrams and Hall mobilities of the TBBG 
devices. a, Resistance of the 1.09° sample versus carrier density and 
perpendicular magnetic field. b, Hall mobility j,,,, (left axis) and Hall 
coefficient R,, (right axis) inthe 1.09° sample at different carrier densities. 
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c-f, Same measurements as ina, b but for the 0.84° (c, d) and 1.23° (e, f) 
samples, respectively. All measurements are taken at 7<100 mK. The data for 


the 1.09° device are taken at D/e, = 0.2 Vnm ‘while the data for the other two 
devices are taken at D=0. 
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Extended Data Fig. 3| Linear resistance versus temperature behaviour in 
TBBG.a, b, Resistance versus temperature curves at different charge densities 
inthe 1.23° sample for the electron-doping side (a) and the hole-doping side (b). 
The inset ina shows the slope dR,,./d7T of the linear R-T behaviour asa function 
ofnfor T>10K.c, Selected R-T curves near n,/2 froma. d, Similar linear 


R-Tbehaviour inthe 1.09° device. The inset shows the slope dR,,/dT. 

e, Density-dependent sublinear/linear R-7 behaviour in the 0.84° device. The 
inset shows the slope dR,,/d7 versus nin log-log scale. The slope is 
proportional tonto the power of -1.77. 
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Extended Data Fig. 4| Additional magnetic field response of TBBG devices. a-d, Response of the n,/4 (a, b) and 3n,/4 (c, d) states in perpendicular magnetic 
field (a, c) and in-plane magnetic field (b, d) for the 9 =1.23° device. 
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Extended Data Fig. 5|/-Vcurvesin the 1.23° TBBG device at different 
carrier densities. D/c, =-0.38 Vnm".a-c, The densities correspond 
approximately to then,/4 (a) and n,/2 (c) insulating states while the density for 
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blies between them. The left axis is the longitudinal voltage V,.and the right 
axis is the differential resistance dV,,/d/,. 
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Extended Data Fig. 6| Comparison of the gap sizes and the g-factor using 
small and large excitations. a, b, The Arrhenius fits of the resistance at the 
n,/2 state of the 1.23° TBBG device in an in-plane magnetic field. c,d, The same 
fits for the n,/4 state. aand care measured using a current excitation of 0.1nA, 


b Large bias (5~10 nA) 


while b and d are measured using a voltage excitation of around 100 pV, which 
induces a current of around 5-10 nA inthe sample. The insets in each panel 
show the corresponding g-factor fittings. In general, by using an excessive 
excitation, both the energy gaps and the g-factor will be underestimated. 


Extended Data Table 1| List of TBBG devices discussed in the main text and Extended Data figures 


A(°) Ceg(F/m) | Cyg(Fim’) | Vego(V) | Vogo(V) | ™s(cm™*) 
1.09 6.63x10% | 5.02x10~ 0.30 0.58 2.7510” 
1.23 1.06x10° | 7.14104 0.41 -0.04 3.55x10" 
193 1.06x103 | 7.14x104 0.41 -0.04 3.55x10" 
0.84 6.87x104 | 6.38x10~ 0.06 0.08 1.65x10” 
0.79 1.06x103 | 3.57104 0.18 0.67 1.4510” 
1.097%) | 1.03x103 | 5.12104 0.28 0.45 2.7510” 


The last device is marked with an asterisk to dif 


erentiate it from the first device, which happens to have the same twist angle, but it is a totally independent device fabricated on a separate chip. 
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Reducing the energy bandwidth of electrons in a lattice below the long-range 
Coulomb interaction energy promotes correlation effects. Moiré superlattices— 
which are created by stacking van der Waals heterostructures with a controlled twist 
angle! 3—enable the engineering of electron band structure. Exotic quantum phases 
can emerge in an engineered moiré flat band. The recent discovery of correlated 
insulator states, superconductivity and the quantum anomalous Hall effect in the flat 
band of magic-angle twisted bilayer graphene’ ® has sparked the exploration of 
correlated electron states in other moiré systems” “. The electronic properties of 

van der Waals moiré superlattices can further be tuned by adjusting the interlayer 
coupling’ or the band structure of constituent layers’. Here, using van der Waals 
heterostructures of twisted double bilayer graphene (TDBG), we demonstrate a flat 
electron band that is tunable by perpendicular electric fields ina range of twist angles. 
Similarly to magic-angle twisted bilayer graphene, TDBG shows energy gaps at the 
half- and quarter-filled flat bands, indicating the emergence of correlated insulator 
states. We find that the gaps of these insulator states increase with in-plane magnetic 
field, suggesting a ferromagnetic order. On doping the half-filled insulator, asudden 
drop in resistivity is observed with decreasing temperature. This critical behaviour is 
confined to a small area in the density—electric-field plane, and is attributed to a phase 


transition from anormal metal to a spin-polarized correlated state. The discovery of 
spin-polarized correlated states in electric-field-tunable TDBG provides anew route 
to engineering interaction-driven quantum phases. 


Moiré superlattices of two-dimensional (2D) van der Waals (vdW) mate- 
rials provide anew scheme for creating correlated electronic states. By 
controlling the twist angle 6 between atomically thin vdW layers, the 
size of the moiré unit cell can be tuned??. In particular, in twisted bilayer 
graphene (TBG), the weak interlayer coupling can open up energy gaps 
at the boundary of the mini-Brillouin zone, which modifies the energy 
bands of the coupled system. Theoretically, it has been predicted that 
around @=1.1° (the so-called magic angle, MA), the interlayer hybridiza- 
tion induces isolated flat bands with drastically reduced bandwidth and 
enhanced density of states”. The combination of flat bands and moiré 
periodic potential fosters an environment where strongly correlated 
states can emerge. Recent experiments performed in MA-TBG indeed 
confirmed the appearance of correlated insulator states associated 
with the flat bands*. Intriguingly, on doping the half-filled insulator, 
superconductivity was discovered. The phase diagram of MA-TBG thus 
phenomenologically resembles that of high-temperature supercon- 
ductors, whose undoped parent compounds are Mott insulators”. 
Asaresult, there is hope that MA-TBG could bea gateway to understand- 
ing the long-lasting puzzle of high-temperature superconductivity. 


Yet, in recent studies, the connection between superconductivity and 
the correlated insulator state has been debated“. 

One method to study the MA-TBG system is to tune the band structure 
through the flat-band condition and observe howthe correlated physics 
changes. So far, such experimental control has largely been achieved 
by fabricating samples with different twist angles. However, different 
samples—owing to differences in uncontrollable factors such as the 
alignment with hexagonal boron nitride (hBN), strain and dielectric 
thickness—often yield contradicting results regarding where the cor- 
related insulator and superconductivity appear. Only limited tunability 
has been demonstrated in TBG by the application of hydrostatic pres- 
sure’. In ABC trilayer graphene/hBN superlattices, the electric field 
has been shown to modulate the correlated insulator gap’, opening 
up the possibility of continuous tuning of the moiré flat band with 
electric field. However, the difficulty in identifying and preserving the 
unstable ABC trilayer graphene, together with the precise alignment 
required between the graphene and hBN layers, makes it a less acces- 
sible platform. Here we demonstrate a wide range of electric field tun- 
ability inthe moiré flat band of twisted double bilayer graphene (TDBG), 
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Fig. 1| Band structure and insulating states in the 9=1.33° sample. 

a, Schematic of TDBG witha twist angle 0. b, Calculated band structure for 
0=1.33° TDBGat an optimal displacement field. k,and k, are wave vectors in 
thexand ydirection.c, Brillouin zone and band structure of the two individual 
bilayer graphene layers under a perpendicular displacement field. The dashed 
hexagons represent the mini-Brillouin zone of the moiré superlattice. K, and 

Kj (K, andK3) are the two valleys of the top (bottom) bilayer graphene. d, Device 


consisting of two Bernal-stacked bilayer graphene sheets misaligned 
with a twist angle 6 (Fig. 1a). 

In twisted systems, the twist angle for achieving a flat band is deter- 
mined by the band structure of the individual layer and the interlayer 
coupling strength. Unlike monolayer graphene, the band structure of 
Bernal-stacked bilayer graphene can be tuned by a perpendicular dis- 
placement field D (ref. 7°). As |D| increases, the parabolic band touching 
at charge neutrality of bilayer graphene opens upa gap and the bottom 
(top) of the conduction (valence) band lifts up (down) into a shallow 
Mexican-hat-shaped energy dispersion distorted by trigonal warping”. 
The gap in bilayer graphene can be as large as 200 meV for large |D| 
before the gate dielectric breaks down”. In TDBG, where two bilayers 
are stacked, the displacement field affects the energy dispersion of 
each constituent bilayer graphene, allowing anew experimental ‘knob’ 
to tune the flat-band condition (Fig. 1c). Figure 1b shows moiré band 
structures calculated at finite D using the single-particle continuum 
model approximation” *”. We find that a well-isolated narrow conduc- 
tion band can appear for a range of twist angles 0, where the interband 
energy gaps and bandwidth can be controlled by the displacement field 
(see Methods for details). 

We fabricated TDBG devices by tearing and stacking Bernal-stacked 
bilayer graphene”®”’. We measured in total seven devices with twist 
angle 0 = 1.26, 1.32, 1.33, 1.41, 1.48, 1.53 and 2.00°, with the first six 
devices showing signatures of correlation effects. All of the devices 
measured are encapsulated by hBN. Top gates are made from graphite 
or metal, and bottom gates are made from graphite or silicon (details 
for each device structure are shown in Extended Data Fig. 9). We focus 
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schematic with graphite top and bottom gates. The arrows represent 
displacement fields generated by the top gate. e, Optical microscope image of 
the1.33° device. f, Resistivity as a function of top and bottom gate voltages. 
CNP, full-filled gaps (+n,) and half-filled gaps (n,/2) are marked. The 
displacement field where the CNP starts to open up a gap is labelled with D, and 
D3 labels where the gap at the full electron (hole) band filling closes. 


our study on the two representative devices 8=1.33° and 0=1.26°, and 
summarize the behaviour of the other devices in Methods and Extended 
Data Table 1. The top and bottom gates with voltages V,,, and V,, are 
used to control the density of electrons, n, and displacement field, 
D, independently: n = (C;gGVig + CogVpc)/e and D = (CygQVrg — CagVg6)/2, 
where C,, (C,,) is the capacitance between the TDBG and the top (bot- 
tom) gate and eis the elementary charge. 

Figure 1f shows the four-probe resistivity 9 measured in the TDBG 
with 8=1.33° asa function of V;, and V,, at temperature 7=1.6 K. CNP 
represents the charge neutral point of the TDBG and n, denotes the 
full filling of the flat band, corresponding to four electrons per moiré 
unit cell, originating from the spin and valley degeneracy. For alinecut 
along aconstant displacement field D = (D, + D3)/2 (the positions of D, 
and Dj are labelled in Fig. 1f), p shows several insulating states where 
the corresponding conductance o=p ‘vanishes as the temperature T 
decreases (Fig. 2a), suggesting a gap opening at the Fermi level of the 
system. Some insulating regions identified in Fig. 1f can be well 
explained by the single-particle band structure presented in Fig. 1b. 
For example, we find that the CNP is gapless at D= 0 but developsa gap 
for |D| > D, # 0. Similarly, at full moiré band filling n =+n,, energy gaps 
A,,are present within displacement field ranges |D| <D3. Consequently, 
for D;< |D| < D3(D,< |D| < D5), there is anisolated conduction (valence) 
band. Note that D3 is different in the conduction band (+) and valence 
band (-), owing to the lack of electron-hole symmetry in TDBG. All 
these single-particle bandgaps are nicely captured by our calculation 
based ona continuum model (Extended Data Fig. 1). The calculation 
also captures the cross-like feature in Fig. 1f, which matches with the 
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Fig. 2 | Spin polarization of the correlated insulator states. a, Temperature 
dependence of conductivity (0) as a function of carrier density at aconstant 
displacement field that passes through the half-filled insulator (D = (D,+ D3)/2). 
Inset: Arrhenius plot for the full-filled insulating state (n,) and the half-filled 
insulating state (n,/2) under different in-plane magnetic fields. b, Effective 
mass measured by temperature-dependent quantum oscillations 
corresponding toa.c, Development of the correlated insulating states with 
in-plane magnetic fields. Top and bottom panels compare resistivity as a 
function ofnand Dunder in-plane magnetic fields of B= 0 T (bottom) and 
B,=13T (top). The middle panel shows the continuous evolution of the 
correlated states by taking a linecut along the dashed lines in the top and 


van Hove singularities of the bands (details in Methods). Lastly, the 
calculated band structure (Fig. 1b) indeed demonstrates the existence 
of anisolated flat band at 9=1.33° under finite displacement field with 
a bandwidth around 10-15 meV. 

Inthis single-particle band structure, we expect a narrow but uninter- 
rupted spectrum within the lowest moiré conduction band (c,), sepa- 
rated by bandgaps from both the valence band (v,) and higher 
conduction band (c,) for D,<|D| <D3. However, we observe the develop- 
ment of a well-defined insulating behaviour at half-filling n =n,/2 
(Figs. 1f, 2a). The onset displacement field of this insulating state coin- 
cides with D,. However, it ends well before D reaches D3, suggesting 
both the isolation and the flatness of the band are required for creating 
the observed correlated gap (Fig. If). Along the same linecut shownin 
Fig. 2a (D = (D, + D3)/2), we measure the effective cyclotron mass m* 
from the temperature-dependent magnetoresistance oscillations 
(Extended Data Fig. 8). Figure 2b shows that m* = 0.2m, for the first 
valence band (v,) and m* = 0.3m, for the first conduction band (c,), 
where m, is the bare electron mass. Considering the effective mass of 
Bernal-stacked bilayer graphene is about 0.04m, (ref. °°), the experi- 
mentally observed large m* indicates an order of magnitude narrower 
bandwidth than that of bilayer graphene bands folded in the moiré 
superlattice Brillouin zone, especially for the c, band. We then use the 
conduction band effective mass m*= 0.3m, to estimate the bandwidth 
of the c, band to be about 10 meV. This bandwidth matches with the 
continuum model calculation of TDBG”, confirming the existence of 
the flat band experimentally. The absence of correlated insulating 
behaviour in the hole-doped regime under similar experimental 


bottom panels. All three panels are measured at atemperature of 7=1.5K. 

d, Half-filled insulating gap 4, /2, quarter-filled insulating gap 4, ,,and 
full-filled gap A,,asa function of in-plane magnetic field. The black dashed line 
indicates Zeeman energy withg=2. A, ,. of both devices increases with 
in-plane magnetic field, indicating spin polarization of the half-filled insulator. 
We also note that the single-particle gap A,, betweenc, and c, (purple curve) 
decreases linearly with Zeeman energy with ag-factor of 2 (purple dashed line). 
Insets: schematic of the half-filled insulating state at zero (left) and large (right) 
in-plane fields. The x axis is density of state (DOS) and the yaxis is energy (EF; is 
the Fermi energy). Grey represents the inert valance band while orange (red) 
represents the lower (upper) half of the first conduction band. 


conditions can be explained by the larger bandwidth of the moiré 
valence band v, than that of c, (Methods). 

We measure the size of the insulating gaps from the activating behav- 
iour of p (Fig. 2a, inset). For @=1.33° TDBG, the half-filled insulator is 
robust with an energy gap of A, ».=3 meV and persists up to a perpen- 
dicular magnetic field B, ~7 T (Extended Data Fig. 7). As the c, band is 
spin and valley degenerate in a single-particle picture, the half-filled 
insulator is probably polarized in the fourfold spin-valley space. The 
in-plane magnetic field B, can be used to probe the spin structure of 
the state without substantially coupling to the valley degrees of free- 
dominthe regime where in-plane orbital effect is negligible. In MA-TBG, 
it has been shown that B, reduces 4, />. Figure 2a inset and Fig. 2c show 
the change of pasa function of B, inour TDBG sample. We find that the 
half-filled insulator becomes more insulating as B, increases (Fig. 2a, 
inset) and the displacement field range spanned by the half-filled insu- 
lator expands (Fig. 2c). More quantitatively, we find that the growth of 
An./2 roughly follows the Zeeman energy scale gy,B,, where pp is the 
Bohr magneton and the effective g-factor g = 2 (dashed black line in 
Fig. 2d). This observation is consistent with a picture where the occu- 
pied states (half of the states inc,) are spin polarized along the direction 
of the external magnetic field. The unoccupied excited states then 
carry the opposite spin, separated by a ferromagnetic gap due to spon- 
taneous symmetry breaking at half-filling. For spin-1/2, the Zeeman 
term lowers the energy of the filled states AF, =-—gu,B/2, while boosting 
the energy of the empty states with opposite spins by AF, = gy1,B/2, 
pushing the two bands further apart and enhancing the gap (as illus- 
trated by Fig. 2d, insets). Calculations from the Hartree-Fock 
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Fig. 3 | Critical transition behaviour in the 8=1.26° sample.a, Resistivity 

map around the half-filled insulator. Dashed circle marks the halo. b, Resistivity 
asa function of temperature at different spots marked by the coloured symbols 
in ausing the corresponding colours and shapes. c, /-Vcurves at the black circle 


approximation also support the existence of a spin-polarized correlated 
insulating state at half-filling in TDBG?>*. 

Inthe 8=1.33° device, applying 8, also induces additional correlated 
insulating states at quarter-filling (n= qn ;) and three-quarter-filling 
(n= an 5) (Fig. 2c). The quarter-filled insulating gap opens at By, =4T 
andi increases as B, increases (Fig. 2d). According to the hierarchy of 
the symmetry-broken states within mean-field theory”, the 
quarter-filled gaps separate the ground state and the excited state of 
thesame spin and opposite valleys, and thus should be relatively insen- 
sitive to in-plane magnetic fields. However, the enhancement of 
quarter-filled gaps with B, and the positions where quarter-filled insu- 
lating states appear in the n-D plane (Extended Data Fig. 2) suggest 
that these gaps probably separate states of opposite spin, hinting that 
the origin of these strongly correlated states goes beyond a simple 
mean-field approach. 

Inthe @=1.26° sample, asimilar spin-polarized half-filled insulating 
state is observed (Fig. 3a), with a much smaller correlated gap 
An/2= 0.3 meV (red line in Fig. 2d). On doping the half-filled insulator, 
we identify the appearance of a ‘halo’ (marked with a dashed circle in 
Fig. 3a) surrounding the half-filled insulating state in the V,,—V,, plane. 
On the halo, the resistivity is slightly higher than that of the nearby 
region. Such a halo-like feature commonly appears around the cor- 
related insulating states in different samples with varying twist angles 
(Fig. 4a—d, Extended Data Fig. 9). The half-filled insulting state divides 
the halo-like region into two. For the samples with a strong half-filled 
insulating gap, Hall measurements performed at low magnetic fields 
(Extended Data Fig. 2) showa sign change of the Hall signal across the 
boundary of the halo and also across the correlated insulator. A similar 
observation has been noted ina recent related study”. The sign of the 
Hall signal inside the halo complies with the carrier concentration 
counted from half-filling. This suggests the metallic state in the halo 
is obtained by adding carriers to the spin-polarized band at half-filling 
while retaining the spin-splitting of the band (Fig. 2d, inset), and there- 
fore is probably a ferromagnetic metal. As the Hall signal outside the 
halo matches the expectation for a moiré band without correlation, 
the halo marks the border between the spin-polarized and the 
spin-unpolarized metallic states (Extended Data Fig. 2c). 

Studying the temperature dependence of the resistivity, (7), inside 
the halo, we identify a critical transition with a sudden drop in resistivity 
as the temperature decreases. Figure 3b shows the resistivity meas- 
ured at different gate configurations marked by matching symbols in 
Fig. 3a. We note that the critical transition behaviour, namely the sud- 
dendropin resistivity, occurs only inside the halo. In contrast, resistivity 
outside the halo increases linearly with temperature. The resistivity 
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ina. Top left inset: /-Vona logarithmic scale, demonstrating the BKT-like 
power-law behaviour (see Extended Data Fig. 4 for more details). Bottom right 
inset: dV/d/as a function of bias current, which shows a critical current of about 
300 nA. Thecurves cover temperature from 1.5 K (blue) to 8.5 K (red). 


behaviour outside the halo is most likely due to ballistic transport at low 
temperatures and enhanced phonon scattering at elevated tempera- 
tures. The critical transition behaviour of p(7) inside the halo, however, 
appears non-trivial. The p(7) curve of the 1.26° device (black curve in 
Fig. 3b) in particular strongly resembles that of asuperconductor, with 
near-zero resistivity below 3.5 K. The current-voltage (/-V) curve also 
shows superconducting-like nonlinear behaviours: dV/d/ vanishes for 
bias current smaller than the critical current, /</., and increases toa 
near-constant value that is close to the normal resistivity above the 
critical transition for />/, (Fig. 3c, bottom right inset). This nonlinear 
/-Vcharacteristic is distinct from that of a heating effect (see Methods 
for amore detailed analysis) and seemingly follows that of the Berez- 
inskii-Kosterlitz-Thouless (BKT) transition (Extended Data Fig. 4e). 
While p(7) and the /-V characteristic discussed above for the 1.26° 
device are suggestive of superconductivity, we note that several fac- 
tors require careful consideration. First, we have not observed direct 
evidence of superconducting phase coherence, such as the Fraun- 
hofer pattern under magnetic fields. Second, p(7 < T,) = 0 has been 
observed only forthe 1.26° device. Figure 4a—d shows four other devices 
we measured with the twist angle ranging between 1.32° and 1.48°. In 
these devices, similar to the 1.26° device, critical transition behaviours 
inthe p(7) curves are commonly observed inside the halo region that 
surrounds the half-filled insulator. These critical behaviours are best 
illustrated by the clear peaks in dp/dT, which are absent outside the 
halo (Fig. 4e, f). The critical temperatures, defined as the temperature 
where dp/d7T is maximum, are similar across all devices (T. = 6-9 K), 
despite their very different half-filled insulating gap sizes (Extended 
Data Table 1). However, the low-temperature resistivity p(T < T,) does 
not reach zero, unlike inthe 1.26° device (Fig. 4e, Extended Data Table 1). 
Strong nonlinear /- Vat low temperatures is also absent in these devices. 
On the basis of these experimental findings, we propose a few sce- 
narios to explain the observed critical transition behaviour. One pos- 
sibility is that the critical transition is a result of Cooper-pair formation, 
but superconductivity is developed only in the 1.26° device. In other 
devices, the establishment of phase coherence may be inhibited by an 
inhomogeneous distribution of strains or disorder. Alternatively, the 
critical transition may correspond toa ferromagnetic transition of the 
doped half-filled insulating states. Here we note that the critical transi- 
tion behaviours occur only inside the halo region, which we associate 
with ferromagnetic metallic states. As the temperature increases, the 
ferromagnetic metal turns into a normal metal when the correlation 
effect vanishes. Below the critical temperature, carrier scattering pro- 
cesses related to spin-flip can potentially be suppressed by the ferro- 
magnetic order, resulting in a reduced resistivity. These two scenarios 
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Fig. 4| Gate-tunable flat band and critical behaviour inaseries of twist 
angles. a—d, Resistivity as a function of filling fraction and displacement field 
around the half-filled insulators measured in four different samples with twist 
angles of 1.32° (a), 1.33° (b), 1.41° (c) and1.48° (d).e, f, Resistivity as a function of 
temperature inside (e) and outside (f) the halo regions surrounding the 


do not necessarily compete with each other, leaving open the possibility 
of a ferromagnetic superconductor (Methods, Extended Data Fig. 5). 
The highly tunable electronic structures of TDBG demonstrated here 
and in related studies ** may provide a new route to engineer cor- 
related phenomena ina moiré superlattice. 
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Methods 


Band structure of TDBG 

The band structure of TDBG with Bernal-stacked bilayers was obtained 
as follows. In TDBG, each bilayer graphene has a tight-binding Bloch 
Hamiltonian ata momentum k given by 


U+4 = ~~ f(k) yf" (k) y, 
“Vf (k U k *(k 
H(k)= Yof” (k) ‘ y;f(k) — y,f" (k) ae 
y,f(kK) ¥,f(K)  U, —-%f (k) 
Yi V,f(k)  -yf° (kK) U,+A 


whichis labelled inthe order of Al, B1, A2 and B2sites of the top (1) andthe 
bottom (2) Bernal stacked bilayer graphene. In function f(k) =>, elk 6, 
iis the imaginary unit, index / runs from one to three, 6, = a(O, 1), 
6,= a(/3/2, - 1/2) and 6; = a(- /3/2,- 1/2), with a=1.42 A. f*(k) is the 
complex conjugate of f(k). In particular, the electrostatic energy dif- 
ference U between the top and bottom layers is an important tuning 
parameter controlled by D. With this Hamiltonian, one can follow 
the continuum model approach in ref. * to calculate the moiré band 
structure. 

Inthe numerical simulation, we use phenomenological parameters 


(Yor Vy Vy Yyr 4) = (2,610, 361, 283, 138, - 15) meV (2) 


obtained from ref.**. Compared with TBG, TDBG has additional param- 
eters y, and y; (trigonal warping), y, (particle-hole asymmetry) and 4, 
in addition to y, (nearest-neighbour hopping). Here, y, and A are the 
interlayer hopping and the on-site energy at A-B stacked sites, where 
the A site of the first layer (A1) sits on top of the B site of the second 
layer (B2), respectively. Although these parameters are much smaller 
than y>, they are important to understand the experimental data. In 
particular, for vanishing U, a finite value of y; yields a larger bandwidth 
and overlap between c, and v,. This is why the system is metallic at the 
CNP and there is no magic-angle condition at D = 0. Furthermore, y4 
and A give rise to the electron-hole asymmetry. Owing to these terms, 
the bandwidth of v, is much larger than that of c,, resulting in smaller 
band isolation for v, (Extended Data Fig. 1). For the moiré hopping 
parameters, (Wo, W,) = (0.08, 0.1) eV is used to account for the relaxation 
effect described by ref.°°. The relaxation increases the gap between c, 
and c, (v, and v,), stabilizing the insulating states for the range of D at 
n=+n, fillings. For more details, see ref. ”. 

Extended Data Fig. 1c, d shows a direct comparison between exper- 
imental resistivity and the calculated density of states at the Fermi 
energy for the 8 = 1.33° TDBG. The experimental results are plotted 
against displacement field D while the calculation is plotted against 
onset potential difference between the top-most and bottom-most 
graphene layer U. The conversion between the experimental param- 
eter Dand the calculation parameter Uis not straightforward owing to 
the screening of the electric field by the graphene layers themselves. 
Thus, converting D to Urequires a self-consistent calculation of the 
screening effect produced by the TDBG band structure, which in turn 
depends on U. However, in Extended Data Fig. 1, we see a very good 
match between experimental single-particle insulating states and 
theoretical gaps at the Fermi energy when we convert D into Uwith 
an empirical factor: U=0.1nm ~ eD, where e is the electron charge. 
Besides the single-particle gaps, the calculation also shows regions 
of high density of states. In particular, in experimental data, there are 
two lines with higher resistivity than the surroundings: from (n, D) = 
(-1, -0.2) to (1, 0.6) and from (n, D) = (-1, 0.2) to (1, -0.6). These two 
lines form a cross and pass through the half-filled insulator as well as 
the halo features. By comparing with the calculation, we recognize that 
these experimental features correspond to the regions of high density 
of states shown in Extended Data Fig. 1d. 


Hall effect and ferromagnetism phase boundary 

In Extended Data Fig. 2, we show the Hall effect measured inthe sample 
with the most robust half-filled insulating states (@ = 1.41°, 
An/2= 4-2 meV). There is a clear change in Hall resistance behaviour 
across the halo boundary, whichis identifiable in the longitudinal resist- 
ance measurement (Extended Data Fig. 2a). Inside the halo, the Hall 
resistance changes sign across the half-filled insulating state, with a 
positive value above half-filling and a negative value below half-filling. 
This Hall measurement demonstrates that the metallic states inside 
the halo are closely related to the half-filled insulator, probably by a 
change of Fermi level from half-filling to the inside of the subband 
(Extended Data Fig. 2c). Given that the half-filled insulator is spin polar- 
ized, the metallic states are probably a ferromagnetic metal, which 
contains two spin-polarized bands that are shifted in energy by the 
ferromagnetic exchange coupling ((2) in Extended Data Fig. 2c). Outside 
the halo, the Hall effect follows the expectation of a single-particle 
moiré band without correlation effects such as in large-angle TBG, 
indicating that the system recovers to anormal metallic state. 

In Extended Data Fig. 2a, we also notice that a three-quarter-filled 
insulating state appears onthe border of the halo. The simplest possible 
candidate of this state is a spin- and valley-polarized state. However, 
within a simple mean-field picture, we expect the lowest-lying excita- 
tions in this state to be associated with valley-flip rather than spin-flip 
as the spin exchange coupling is expected to be larger than the valley 
exchange splitting”®. This naive picture appears to be inconsistent with 
the enhancement of the gap by the in-plane magnetic field shown in 
Fig. 2d in the main text. In addition, the appearance of the quarter-filling 
state right at the edge of the halo suggests that a mean-field picture 
may fail to capture the relevant physics. We leave this question regard- 
ing the nature of the quarter-filling state to future theoretical works. 


Magnetic-field-induced Chern insulator state in the 1.26° 
sample 

In the 6 =1.26° sample, we observed distinctly different behaviour of 
the Hall resistance. Under a small perpendicular magnetic field, the Hall 
resistance is always positive inside the halo (Extended Data Fig. 3b), 
rather than changing signs across half-filling. The absent sign change 
of the Hall signal across half-filling may be due to thermal excited car- 
riers of both types as a result of the small insulating gap. It could also 
be due to the Chern insulator behaviour discussed below. Measuring 
the Hall resistance with changing magnetic field and density ata fixed 
displacement field (Extended Data Fig. 3d) reveals a single line of large 
Hall signal tracing to half-filling with a slope corresponding to v=4. At 
the same time, longitudinal resistance develops a minimum along the 
same line (Extended Data Fig. 3c). Following the v=4 line (black guid- 
ing lines in Extended Data Fig. 3c, d), Extended Data Fig. 3e shows the 
Hall resistance reaches close to a quantized value of h/4e? when the 
perpendicular magnetic field B, >3 T. 

The fact that a single Hall plateau emerges from half-filling strongly 
suggests that this is not anormal quantum Hall state. Instead, the data 
highly resemble the Chern insulator shown in MA-TBG”. Indeed, our 
theory predicts that in TDBG, the first conduction band has a Chern 
number C=2in one valley and an opposite Chern number C=~-2 inthe 
other valley”’. As shown in the main text, without a perpendicular mag- 
netic field, the half-filled state is spin polarized and valley unpolarized, 
giving a total Chern number of 0. However, a perpendicular magnetic 
field couples to the valley degree of freedom through the orbital val- 
ley Zeeman effect. When the spin-polarized gap is small suchas inthe 
1.26° device, the valley Zeeman energy can overcome the spontaneous 
spin-polarized gap and converts the spin-polarized half-filled insulating 
state into a valley-polarized Chern insulator. Using the valley Zeeman 
factor from anscanning tunnelling microscopy study and calculation”, 
we estimate that valley Zeeman energy surpasses the 0.3 meV gap ata 
perpendicular field of 0.2 T. This valley-polarized half-filled state fills 


two moiré bands (of spin up and spin down) in one valley, adding up 
toa total Chern number of four. 


Critical transition behaviours in the 6=1.26° sample 

In Extended Data Fig. 4c, the dome of the superconducting-like state, 
similar to MA-TBG, can be seen next to the half-filled insulator. In addi- 
tion, cutting through aconstant density line, asimilar dome structure 
is visible over the displacement field axis (Extended Data Fig. 4b). 
The dome in the displacement field terminates on the boundary of 
the halo. It may first appear that the low-resistance state outside the 
halo boundary resembles a superconductor as well. However, as we 
discussed in the main text, there is no critical transition outside the 
halo. In addition, the /-V characteristic outside the halo is very differ- 
ent from that inside the halo. Within the halo, differential resistance 
shows a critical current that reduces to zero when approaching the 
halo boundary (Extended Data Fig. 4f). Outside the halo, in contrast, 
the /-V characteristic does not fit that of asuperconductor (Extended 
Data Fig. 4g). We believe the low resistance outside the halo is purely 
caused by ballistic transport. Extended Data Fig. 4d shows that the 
superconducting-like state has a critical perpendicular magnetic field 
of about 0.1T. 

Ina recent study”, He et al. observed a sudden drop of resistance 
witha residue resistance of about 1kQ, similar to our 1.32° device. They 
also reported a nonlinear /-Vcurve, where dV/d/(/) gradually increases 
with current ina parabolic manner up to a factor of two without signs 
of critical current. This observation is in stark contrast to the data from 
our superconducting-like sample, where dV/d/ reaches zero at zero bias 
and saturates toa finite value above the critical current. In their paper, 
He et al. explain this nonlinear /-V as an effect from the temperature 
increase due to bias current heating. While this argument can explain 
the nonlinear /-V observed in their sample with large residue resist- 
ance, we demonstrate that our superconductor-like nonlinear /-Vis 
unlikely caused by heating. 

At 2K, the zero-bias resistivity of the sample is close to zero. If we 
assume the sample is metallic, we can translate resistivity to thermal 
conductivity. Taking the upper bound of resistivity to be 50 O, it con- 
verts to 1nW Kin heat conductivity according to the Wiedemann- 
Franz law at 2 K. From the /-V curve shown in Fig. 3c, we can extract 
the heating power to be about 13 pW at the critical current of about 
300 nA. Asaresult, the temperature increase is about 13 mK. Incontrast, 
it requires a 5 K temperature increase to bring the sample resistance 
to the normal value (Fig. 3b). This provides additional confirmation 
that our observed nonlinear /-Vis probably an intrinsic property of 
the device rather than a heating effect. 


Enhancement of the transition temperature with B, 

If the superconducting-like behaviour in the 1.26° sample is indeed 
from superconductivity, the parallel field dependence shown below 
suggests that it might be an exotic spin-polarized superconductor. 
Here we investigate the behaviour of p(7) as a function of B. Extended 
Data Fig. 5a shows a superconducting dome in the (n, B,) plane with a 
maximum critical parallel magnetic field B; ~ 1 T. The salient experi- 
mental feature is the B, dependence of the superconducting state below 
the critical field By Extended Data Fig. 5b shows p at optimal density 
and displacement field (7,,,D,,) as a function of Tand B,. In this optimal 
superconducting state, p vanishes critically as Tand B, decreases. We 
use a phenomenological definition of the critical temperature 75, 
defined as the 50% transition point. Interestingly, T5o.,(B,) follows a 
non-monotonic behaviour. In particular, 75o,,increases as B, increases 
from O to about 0.3 T before it decreases for B, > 0.3 T. We also per- 
formed /-V characterization at the optimal gate configuration (n,,, D,,) 
as a function of B, and 7 to obtain 7,,; (Extended Data Fig. 4e). Similar 
to T5o,,above, Tgx7(B,) also shows anon-monotonic behaviour as shown 
in Extended Data Fig. 5b (black circles). These sets of evidence suggest 
that a small B, can strengthen the superconductivity. 


The increase of critical temperature with B, suggests that the Cooper 
pairs responsible for the superconductivity here, if confirmed, are 
likely to be spin polarized. One possible scenario for sucha state is illus- 
trated in Extended Data Fig. 5d, where the Cooper pairs form between 
Fermi surfaces with the same spin (spin-triplet) and opposite valleys. 
This model is consistent with our previous discussion of a ferromag- 
netic metal parent state inside the halo next to the half-filled insulator, 
where the Fermi surfaces of two different spins have different filling 
status. In this spin-polarized pairing scheme, a parallel magnetic field 
enlarges the majority spin Fermi surface, and strengthens the supercon- 
ductivity, inducing the change in the critical temperature AT, « B (ref.”’). 
The eventual destruction of superconductivity at high magnetic fields 
can result from the following mechanism. Magnetic flux in between 
layers leads to amomentum shift, which has an opposite signin the two 
valleys, thereby bringing the two pairing Fermi surfaces out of align- 
ment. The latter effect is expected to reduce the critical temperature, 
AT, « —B? (ref. 7°). Alternatively, if the ferromagnetic pairing is caused 
by spin fluctuations, as suggested in the heavy fermion metals*’ *’, a 
strong parallel magnetic field can suppress the superconductivity by 
suppressing spin fluctuations”. 

Meanwhile, inthe measurement shown in Extended Data Fig. 5c, we 
notice that the OT resistivity goes slightly negative (about -30 Q) at the 
lowest temperature for this specific thermal cycle. In general, we find 
that the four-terminal resistivity we measure in the 1.26° device some- 
times shows a small residue (-50-50 Q) that varies between different 
thermal cycles. The residual resistance is not present in the measure- 
ment shown in Fig. 3b. This residual resistance (when it is present ina 
specific thermal cycle) is sensitive to the measurement configuration. 
Extended Data Fig. 6 shows two different four-terminal resistance meas- 
urement configurations. Between the two configurations (blue curve 
and red curve), we essentially flip the direction of the current. As we use 
alow-frequency a.c. (about 17 Hz) for the current source and a lock-in 
amplifier for the voltage probe, in an ideal condition, we expect to 
obtain the same signal with opposite polarity. However, we find that the 
measured signals deviate from this expectation when there is residual 
resistance at low temperature. Specifically, when flowing current from 
top to bottom (blue curve in Extended Data Fig. 6), four-terminal resist- 
ance is positive, and a positive residual resistance of about 50 Oremains 
at the lowest temperature, 2 K. However, when the current is flowing 
from the bottom to the top (red curve), the four-terminal resistance at 
higher temperature is negative as expected, but the residual resistance 
at lower temperature is still positive and is nearly identical to that in 
the blue curve (see inset). 

We believe that the observed anomaly in the residual resistance origi- 
nates from bias-induced gating in combination with thermoelectric 
voltages present in our cryostat wiring. Owing to the temperature gra- 
dient in the cryostat, a d.c. thermoelectric voltage is always present 
between different pairs of wires. This d.c. voltage is simply added to 
the voltage probes on top of the a.c. voltage induced by bias current. 
Insuch cases, thea.c. bias voltage on the sample (about half of the bias 
voltage on the source lead, as the drain is grounded) can modulate the 
d.c. thermoelectric voltage through a bias-induced gating effect (see 
ref.“ for asimilar effect observed ina drag experiment), resulting ina 
voltage signal synchronizing with the applied a.c. bias current. We can 
eliminate this bias-induced a.c. gating effect of thermoelectric voltage 
by subtracting the blue curve from the red curve, where we obtain a 
near-zero residual resistance within the noise level. During the thermal 
cycle in obtaining the data in Extended Data Fig. 5c, we did not collect 
the data in two different configurations, and thus such correction was 
not possible. Note that this a.c. gating effect becomes appreciable only 
when the device resistivity is really small (<50 Q). 


Landau fan diagram 
Inthe1.33° device undera perpendicular magnetic field, clear fans canbe 
identified coming from the CNP, full fillings +n, (about +4.1 x10" cm) 
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as well as half-filling n,/2 on the electron-doped side (Extended Data 
Fig. 7a). The Landau fan from the CNP shows a well-developed quan- 
tum Hall sequence with fourfold degeneracy on the valence band 
side under low magnetic fields, and subsequently develops the full 
degeneracy-lifted quantum Hall states under higher magnetic fields. 
The Landau fans on the conduction band side (n > 0) are highly unusual. 
The fan from the CNP shows a sequence of only odd filling fractions v=3 
and 5. Although the fan coming from the correlated insulating state at 
half-filling shows a degeneracy of two, consistent with the picture of 
aspin-polarized half-filled band, the sequence is also of odd numbers 
v=3,5and 7. The odd-integer sequenced quantum Hall effect from the 
half-filling gap may relate to a Berry phase effect on the Landau level, 
similar to the quantum Hall effect in monolayer graphene. Theory”? 
predicts that the isolated flat conduction band (c,) in TDBG carries 
non-trivial Berry curvature and a non-zero Chern number. It is then pos- 
sible that the Berry curvature accumulates to att or 3m phase oncertain 
Landau orbits, resulting ina quantum Hall filling fraction sequence of 
2(N- 1/2) or 2(N — 3/2), with integer N. 

There is also a single quantum Hall state v = 3 projected down 
to the quarter-filled conduction band. This state could be a 
magnetic-field-induced Chern insulator, similar to the v = 4 Chern 
insulator in 1.26° device discussed above. As this fan projects down 
to quarter-filling, it is likely that both spin and valley are polarized. 
We note that the valley Chern number corresponding to this spe- 
cific state is C = 3, although C = 2 is generally expected”’. Further 
study is required to clarify these experimental observations. Above 
a perpendicular magnetic field of 7 T, the half-filled insulator dis- 
appears, presumably due to the orbital effect of the perpendicular 
magnetic field. 

The fan diagram of the 1.26° device is also intriguing. Besides the 
Landau fans, we point out there is an obvious oscillation of longitu- 
dinal resistance around n =~—n,/2 that does not sensitively depend on 
doping (horizontal features marked in Extended Data Fig. 7b). These 
oscillations are not from quantum Hall states, but from Hofstadter’s 
butterfly. They area result of the interplay between two different perio- 
dicities in the system: moiré superlattice and magnetic length. When 
one moiré superlattice contains 1/N magnetic flux (Nis an integer), the 
two length scales become commensurate and produces a minimumin 
resistivity. These features are indicative of a highly uniform twisting 
angle distribution in the sample. We use these features to determine 
the twist angle in this device. 


Effective mass of the 9 =1.33° and 0=1.26° samples 

We calculate the effective cyclotron mass from temperature-dependent 
magnetoresistance (Shubnikov—de Haas (SdH)) oscillations. The cyclo- 
tron mass is ameasure of the density of states and thus directly related 
to the Landau-level separation (cyclotron gap) under a given mag- 
netic field. As temperature increases, the SdH oscillation amplitude 
is reduced following AR « y/sinh(y), where y = 2@ 4" = 

For the 9 =1.33° sample, we measured SdH escmlanons at all densi- 

ties between filling factor n/n, =-1and 1, at T=0.3, 2,3,4,6,9and14K 
(example: Extended Data Fig. 8a-c). We then extracted the oscillation 
amplitudes and plotted them asa function of 7/B. Fitting AR(7/B) with 
the above formula with m* being the only fitting parameter, we obtained 
the effective cyclotron mass shown in the main text. Similarly, we meas- 
ured SdH oscillations for the @=1.26° sample and extracted an effective 
mass m* = 0.23m,, as shown in Extended Data Fig. 8e, f. 


Device fabrication and characterization 

All the devices presented in this study were prepared using the dry 
transfer method”, using stamps consisting of polypropylene carbon- 
ate film and polydimethylsiloxane. Half of a bilayer graphene flake was 
torn and picked up by a stack of graphite/hBN on the transfer stamp. 
Then the remaining bilayer graphene flake on the substrate was rotated 
by the desired angle and picked up. The stacks were deposited ona 


300-nm SiO,/Si substrate after picking up the rest of hBN and graphite 
layers. Part of the bilayer graphene flakes was extended outside the 
hBNarea onto polypropylene carbonate to prevent the graphene from 
freely rotating on the hBN. The resulting stacks were fabricated into 
1-2-~um-wide devices to ensure a uniform twist angle in the relatively 
narrow channel. The temperature of the stack was always kept below 
180 °C during the stacking and fabrication processes. 

We measured a total of seven devices with different twist angles in 
this study. All samples were encapsulated by the hBN layers. Inthe 1.32°, 
1.33°, 1.41°, 1.53° and 2.00° devices, both the top and bottom gates were 
made from graphite. The 1.26° device utilized a graphite top gate anda 
heavily doped silicon bottom gate below the hBN substrate and 300-nm 
SiO, dielectric. The 1.48° device used a silicon bottom gate and a metal 
top gate. Most of the devices were fabricated into Hall bars with the 
exception of the 1.26° and 1.41° samples, which were fabricated into 
a Van der Pauw geometry. The gate configurations and device images 
are shown in Extended Data Fig. 9. 

The resistivity presented here was measured at 17.7 Hz using the 
standard lock-in technique, with a 0.5-1.0 mV voltage bias and a 
current-limiting-resistor of 1OO kO connected in series with the 
sample, which limits the current in the sample to an upper bound of 
5-10 nA. This bias scheme is to ensure neither the voltage nor the cur- 
rent becomes too large when sweeping across states with drastically 
different resistance (insulators or superconducting-like states). The 
four-terminal voltage and the source-drain current are measured 
simultaneously with two lock-in amplifiers to obtain the four-terminal 
resistance. Resistivity is then obtained by multiplying the resistance 
by ageometric factor (about 4.5 for Van der Pauw devices). 

Figure 1f and Extended Data Fig. 9a, c-g show large-scale gate scans 
of the longitudinal resistivity in all samples. These samples show insu- 
lating states at full filling under the zero displacement field and at the 
CNP under a large displacement field. Moreover, in particular, the 
1.32°, 1.48°, 1.53° and 2.00° devices show a CNP gap under the zero 
displacement field, which closes and reopens with increasing dis- 
placement field. For the 1.32° device, we note that the gate scan dia- 
gram shows a wide insulating region at full filling and double peaks at 
half-filling, suggesting that there is additional moiré periodicity inthe 
channel other than 1.32°. The CNP gap in the 1.32° device under zero 
displacement field probably originated from the larger twist angle 
region of the sample, as none of the neighbouring angle devices (1.26, 
1.33, 1.41°) has a gap at the CNP under zero displacement field. We 
remark that theory” predicts a gap at the CNP opens up even at zero 
displacement field when @>1.5°, qualitatively agreeing with our experi- 
mental observation. 

The twist angles are estimated from two independent methods. For 
the first method, we measure the gate voltages of full-filling gaps and 
convert these voltages to full-filling density n, using the gate capaci- 
tance. The gate capacitance used for this conversion is calibrated by 
Landau fan diagrams. We then calculate the twist angle from the fact 
that full filling corresponds to four electrons per moiré unit cell sothe 
moiré unit cell area A = 4/n,. The main source of errors for this method 
is in locating the exact position of full filling in the gate voltage. We 
also use the radiating Landau fan coming from full filling to help locate 
its exact position. Typically, we can identify the position of full filling 
to an accuracy of 6n,=10" cm”, corresponding to a twist angle error 
of +0.02°. The second method exploits Hofstadter’s butterfly features 
under magnetic fields. Here we find carrier-density-independent oscil- 
lations of the longitudinal resistance R,,. under perpendicular magnetic 
fields, with the minimum of R,,,appearing when the magnetic flux ina 
moiré unit cell is a fraction of the flux quantum, BA = @)/N (Extended 
Data Fig. 7b), where B is magnetic field, @, is the flux quantum and N 
is an integer. These features are used to calculate the twist angle with 
even better accuracy (+0.01°). 

In Extended Data Fig. 9a, we observe negative resistivity in a part 
of the 2D gate scan. These anomalous signals can be attributed to 


non-transparent contacts in these gate regions. Comparing with 
the 2D map of the two-terminal resistance measured in this device 
(Extended Data Fig. 9b), we find that the gate regions where negative 
resistivity were observed in Extended Data Fig. 9a in general correspond 
to the gate regions where a large contact resistance is demonstrated 
by the two-terminal measurement. This strong correlation between 
the apparent ‘negative’ four-terminal resistivity and high contact 
resistance suggests that the negative four-terminal resistivity originates 
from inefficient contact equilibration in these gate regimes. Indeed, 
the contact transparency can be hindered by the unintended p-njunc- 
tion formation near the metal contacts when the applied gate voltages 
conspire with the work function mismatch between graphene and 
metal to cause an accumulation of the opposite polarity of charges near 
the contact and in the channel. However, near the half-filled insulator 
region, where most of our research is focused, the two-terminal resist- 
ance stays less than 10 kO, demonstrating excellent contact transpar- 
ency. Thus, we conclude that the contact anomaly can be excluded as 
a possible cause of our experimental observation near the correlated 
insulator regime. We also notice that the correlated insulating state 
in the positive displacement field side (V,;, < 0 and V,, > 0) shows a 
much weaker signature than the negative displacement field side. The 
absence of a clear signature of correlated insulator on the opposite 
side of the displacement field is also likely due to the inefficient con- 
tact equilibration in these particular gate configurations, as shownin 
Extended Data Fig. 9b. 


Data availability 


The data that support the findings of this study are available from the 
corresponding author upon reasonable request. 
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Extended Data Fig. 1| Theoretical band structure of TDBG. a, Calculated 
band structure of TDBG at zero displacement field and optimal displacement 
field D, for the isolated flat band. b, Calculated parameter space for isolated 
conduction band (x axis is onsite potential difference U= V, - V, between the 
top and bottom graphene layer, y axis is twist angle). Colour represents the 
bandwidth of the first conduction band c, (meV). Inthe coloured parameter 
space, c, is isolated from the second conduction band and the first valence 
bands. The two dotted lines represent cuts at 9=1.26 and 0=1.33°.c, Resistivity 
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asa function of filling fraction and displacement field in the @=1.33° sample. 
Across-like feature of high resistivity is formed along two lines from 

(n, D) = (1,-0.2) to (1, 0.6) and (-1, 0.2) to (1, -0.6), passing through the half-filled 
insulating states. d, Density of states at the Fermi energy calculated by the 
continuum model. The single-particle insulators (n/n, =0, +1) inexperiment 
match well with the gaps shown in the calculation and the van Hove singularity 
captures the cross-like pattern in experiment. 
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Extended Data Fig. 2 | Hall effect ina device with robust half-filled 
insulators. a, b, Longitudinal resistivity (a) and Hall resistance (b) of the 
6=1.41° device around half-filling at 7=1.5 K and under perpendicular magnetic 
field B, =1T. Data are symmetrized between positive and negative fields to 
eliminate mixing. The halo structure is apparent around the half-filled insulator 
anda three-quarter-filled insulating state resides on the border of the halo. 
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The Hall resistance changes sign across the half-filled insulator inside the halo. 
c, Illustration of electron orders for different regimes. The left half (right half) 
of the cartoon represents the band of spin down (up) electrons. For half-filling, 
only one species of spin is filled. Inside the halo, one spin species is populated 
more than the other. Outside the halo, both spins are equally populated. 
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Extended Data Fig. 3 | Field-induced Chern insulator in the 0=1.26° device. 
a, b, Longitudinal resistivity at B =O (a) and Hall resistance at B, =0.5T (b) inthe 
6=1.26° sample at 7=1.5 K. The Hall resistance here is symmetrized with both 
directions of the magnetic field. c, d, Fan diagram of longitudinal resistivity (c) 


-6 
oun 15 2 25 3 
B(T) 4n/ng 


and Hall resistance (d) at 7=1.5 K ataconstant displacement field. The black 
line marks the expected position for v=4 Chern insulator state originating 
from half-filling. e, Longitudinal resistivity and Hall resistance along the black 
lineshownincandd. 
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Extended Data Fig. 4| Critical behaviours in the 0 =1.26° device. 

a, Resistivity in1.26° device plotted against filling factor and displacement 
field. b, Resistivity as a function of displacement field and temperature along 
the constant density line shown ina.c, Resistivity as a function of filling and 
temperature along the tilted line ina. The dome of the low resistance state can 
beseennext tothe half-filled insulator. d, Resistivity on alog scale asa function 
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of filling and perpendicular magnetic field. e, The power ain V«/*asa function 

of temperature from fitting the top left inset of Fig. 3c. a=3is defined as the 

BKT transition temperature. f, Differential resistance as a function of current 

and displacement field along the constant density line shownina.g, /-Vcurves 
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Extended Data Fig. 5 | Enhancement of the critical temperature under 
in-plane magnetic field. a, Resistivity as a function of in-plane magnetic field 
across the half-filled insulator and superconducting-like state in the 1.26° 
device. b, Resistivity as a function of temperature and in-plane magnetic field 
at optimal doping and displacement field. 7,,; denotes the BKT temperature 
extracted from nonlinear IV measurements. 7;o,, marks the temperature where 


2 4 6 8 
T (K) 


resistance is half of the normal resistance. c, Line traces at different in-plane 
magnetic fields. d, Illustration of pairing in spin-polarized superconductor. 
The blue (red) surface represents the spin down (up) electron band. The two 
bandsare filled differently due to the parent ferromagnetic metallic state. 
The hexagon represents the Brillouin zone of graphene lattice. Pairing thus 
happens between Fermi surfaces of the same spin and opposite valleys. 
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Extended Data Fig. 6 | Origin of asmall residual resistivity at T< T,. R(T) curve measured in the superconducting regime of the 1.26° device in two measurement 
configurations. The voltage probes are kept the same between the two configurations while the source and the drain contacts are switched. 
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Extended Data Fig. 7 | Landau fan diagram asa function of filling fraction Horizontal lines highlight the Hofstadter’s butterfly features that occur whena 
and perpendicular magnetic field. a, The 1.33° device. The numbers next to simple fraction of the flux quantum @,/N (Nis an integer) penetrates througha 


the guiding lines indicate Landau-level filling factors. b, The 1.26° device. moiré unit cell. 
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Extended Data Fig. 8 | Effective mass calculation for the 1.33° and 1.26° 

devices. a-c, Temperature-dependent SdH oscillations in the @=1.33° device 
at a few representative density points: n=-1.3 x10” cm” (a), 1.45 x10” cm? (b) 
and 2.65 x10” cm (c).d, Extracted oscillation amplitudes as a function of 7/B 
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for the density configuration shownin a-c and corresponding fitting curves. 
e, Temperature-dependent SdH oscillations in the 8 =1.26° device at 
n=0.61* 10" cm, whichis above half-filling and inside the halo. f, Extracted 
oscillation amplitudes as a function of 7/B in the 9 =1.26° device. 
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Extended Data Fig. 9 | Device characterization. a, c-g, Device structure, 


optical image and four-terminal resistivity map of each device: 1.26° (a), image. b, Two-terminal resistance measured in the 1.26° device in the same gate 
1.32° (c), 1.41° (d), 1.48° (e), 1.53° (f) and 2.00° (g). For the 1.26° device, the 


active device is the four-terminal Van der Pauw sample. The structure of each 
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device is depicted by the cross-section illustration on the left of the optical 


voltage range presented in a. Dashed square marks the active area studied. 
h, Structure and optical image of the 1.33° device. 


Extended Data Table 1| Summary of all TDBG devices studied 


@ 126 132? 133° 141° 148° 1.53°  — 2.00° 
w (meV) 12(11) 13 13(10) 17 a1 24 71 
Ane (meV) 0.30 2.8 3.0 4.2 0.54 0.72 N.A. 


Pn (KQID) <0.01 1.04 082 0.21 0.28 014 NA. 


Minimum bandwidth w at the optimal displacement field obtained from continuum model 
calculation (experimentally estimated bandwidth is shown in the bracket), half-filled gap A, ; 
and resistivity well below the critical transition inside the halo p,,,, for devices with different 
twist angles. There is no sign of any correlated state in the 2.00° device (N.A.). A, ;. shows a 
general trend of diminishing away from the optimal angle 1.41°, although the disorder might 
cause some device-to-device variation. 
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Acentral challenge in developing quantum computers and long-range quantum 
networks is the distribution of entanglement across many individually controllable 


qubits’. Colour centres in diamond have emerged as leading solid-state ‘artificial 
atom’ qubits”” because they enable on-demand remote entanglement’, coherent 
control of over ten ancillae qubits with minute-long coherence times® and 
memory-enhanced quantum communication’. A critical next step is to integrate 
large numbers of artificial atoms with photonic architectures to enable large-scale 
quantum information processing systems. So far, these efforts have been stymied by 
qubit inhomogeneities, low device yield and complex device requirements. Here we 
introduce a process for the high-yield heterogeneous integration of ‘quantum 
microchiplets’—diamond waveguide arrays containing highly coherent colour 
centres—on a photonic integrated circuit (PIC). We use this process to realize a 
128-channel, defect-free array of germanium-vacancy and silicon-vacancy colour 
centres in an aluminium nitride PIC. Photoluminescence spectroscopy reveals 
long-term, stable and narrow average optical linewidths of 54 megahertz 

(146 megahertz) for germanium-vacancy (silicon-vacancy) emitters, close 

to the lifetime-limited linewidth of 32 megahertz (93 megahertz). We show that 
inhomogeneities of individual colour centre optical transitions can be compensated 
insitu by integrated tuning over 50 gigahertz without linewidth degradation. The 
ability to assemble large numbers of nearly indistinguishable and tunable artificial 
atoms into phase-stable PICs marks a key step towards multiplexed quantum 


repeaters’® and general-purpose quantum processors 


9-12 


Artificial atom qubits in diamond combine spin clusters with 
minute-scale coherence times? and efficient spin-photon inter- 
faces”, making them attractive for processing and distributing quan- 
tum information’”. In particular, proposed quantum repeaters for 
long-range, high-speed quantum networks will require hundreds or 
more memory qubits’”, whereas error-corrected quantum comput- 
ing may require millions or more’ ”. However, a critical barrier to 
large-scale quantum information processing is the low device yield 
of functional qubit systems. Furthermore, although individual spin-— 
photon interfaces can now achieve excellent performance, the lack 
of active chip-integrated photonic components and wafer-scale, 
single-crystal diamond currently limit the scalability of monolithic 
diamond quantum-information-processing architectures. A promising 
method to alleviate these constraints is heterogeneous integration, 
which is increasingly used in advanced microelectronics to assemble 
separately fabricated subcomponents into a single, multifunctional 
chip. Such hybrid fabrication has also recently been used to integrate 
photonicintegrated circuits (PICs) with quantum modules, including 
quantum dot single-photon sources”, superconducting nanowire 


single-photon detectors (SNSPDs)"* and nitrogen-vacancy (NV) centre 
diamond waveguides”. However, these demonstrations assembled 
components one by one, which presents a formidable scaling challenge, 
and they did not provide for spectral alignment of artificial atoms. 
Here we introduce a ‘quantum microchiplet’ (QMC) framework that 
greatly improves the yield and accuracy of heterogeneously integrated 
nanoscopic devices. Specifically, this assembly process enables the 
construction of a128-channel photonic integrated artificial atom chip 
containing diamond quantum emitters with high coupling efficiencies, 
optical coherences near the lifetime limit and tunable optical frequen- 
cies to compensate for spectral inhomogeneities on chip. 

Figure 1illustrates the heterogeneous integration process. The mul- 
tichip module consists of a waveguide layer in single-crystal aluminium 
nitride (AIN) for low-loss photonics’, microchiplet sockets to optically 
interface with separately fabricated diamond QMCs and electrical 
layers for controlling colour centre transitions. This PIC platform is 
compatible with additional components, such as on-chip electro-optic 
modulators” and SNSPDs”, for photon switching and photon detec- 
tionin a quantum photonic chip. 


"Research Laboratory of Electronics, MIT, Cambridge, MA, USA. *Sandia National Laboratories, Albuquerque, NM, USA. “Present address: University of California Berkeley, Berkeley, CA, USA. 
‘These authors contributed equally: Noel H. Wan, Tsung-Ju Lu. “e-mail: noelwan@mit.edu; tsungjul@mit.edu; englund@mit.edu 
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§ Microchiplet 
socket 


Multichip module 


Fig. 1| Scalable integration of artificial atoms with photonics. The separate 
fabrication of subcomponents before their final assembly maximizes the yield, 
size and performance of the hybrid emitter-photonics chip. A pick-and-place 


The large optical transparency of the QMC and PIC materials make 
them compatible with a variety of quantum emitters. Here we consider 
the negatively charged germanium-vacancy (GeV) and silicon-vacancy 
(SiV) centres in diamond with zero-phonon line transitions at 602nm 
and 737 nm, respectively, because of their stable optical and spin prop- 
erties”’*>. The process begins with focused ion beam (FIB) implantation 
of Ge’ and Si’ into a1-pm-pitch square array ina single-crystal diamond 
substrate, followed by high-temperature annealing (Methods). This 
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Fig. 2| Fabrication and integration of QMC with integrated photonics. 

a, SEM overview of the parent diamond chip containing over 500 microchiplets 
for heterogeneous integration. b, A16-channel QMC.c, An8-channel QMC with 
varying mechanical beam rigidity. d, Photoluminescence map of GeV centres 


Pick-and-place 
assembly 


Parent diamond chip 


Sapphire 
substrate 


\ | Electrodes 5. 
method transfers pre-screened QMCs from their parent diamond chip intoa 
socket containing efficient photonic interfaces, as well as electrical wires for 
controlling colour centres. 


process generates spots of tightly localized GeV centres (depth of 
about 74 nm, vertical straggle of about 12 nm and lateral full-width at 
half-maximum (FWHM) distribution of about 40 nm) and SiV centres 
(about 113 nm, about 19 nm and about 50 nm, respectively), which we 
then located and mapped relative to prefabricated alignment markers 
by photoluminescence microscopy. We fabricated the QMCs over the 
emitter arrays using acombination of electron-beam lithography (EBL) 
and quasi-isotropic etching”*”. Figure 2a shows a scanning electron 


Diamond QMC 


Sapphire 


(bright spots) ina16-channel QMC. e, Photoluminescence map of SiV centres 
(bright spots) ina defect-free 8-channel QMC. f, AnAIN-on-sapphire integrated 
photonics module that interfaces with the diamond QMC placed in the chiplet 
socket. g, Close-up SEM of the diamond QMC and AIN photonic interfaces. 
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Fig. 3 | Integrated quantum photonics with colour centres. a, Experimental 
setupina4-K cryostat showing the input and output optical interfaces (1), (2) 
and (3). b, Energy level and spectrum of a GeV centre, whereg,e,4, and 4, 
denote the ground state, excited state, ground-state splitting and excited-state 
splitting, respectively. Resonant excitation probed transition C, whichis the 
brightest and narrowest line. c, Optical image of sixteen QMC-populated 
microchiplet sockets containing GeV or SiV centres. The ‘unsuccessful’ 


microscope (SEM) image of various suspended chiplets containing 
8- or 16-channel waveguide arrays connected by diamond ‘trusses’, as 
seen in the close-up SEM images in Fig. 2b, c and Fig. 2g, respectively. 
Structurally, much larger arrays are fabricable and integrable: we suc- 
cessfully transferred QMCs with as many as 64 waveguide compo- 
nents (Methods). Despite a misalignment between the FIB mask and 
the QMC patterns, the photoluminescence scans showed that 39% of 
the 8-channel QMCs are ‘defect free’ (that is, they have one or more 
stable colour centre per waveguide) as shown in Fig. 2e (Methods). The 
defect-free yield of the 16-channel QMCs was lower as these are more 
susceptible to misalignment, so we did not use them in this study. With 
improvements in FIB alignment and lithography, as well as targeted 
fabrication over pre-localized single emitters, an even higher yield 
should be possible in future work (Methods). 
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modules indicate failed QMC placements. Ch., channel. d, Autocorrelation 
measurements ofa single GeV in channel 41 under off-resonant 2-mW, 532-nm 
excitation (left) and under resonant excitation at 602 nm (middle), and 
autocorrelation measurement of a single SiV in channel 65 under resonant 
excitation at 737 nm (right). e, Waveguide-coupled single photons from every 
integrated GeV and SiV channel inthe PIC. The error bars indicate fit 
uncertainties at the 1s.d. level. 


Figure 2f shows one of 20 micro-chiplet sockets connecting 8 input 
and 8 output waveguide arrays to an 8-channel QMC. We fabricated this 
PIC ona wafer of single-crystal AIN ona sapphire substrate using EBL 
and chlorine reactive ion etching’’ (Methods). AIN on sapphireis a suit- 
able platform for linear and nonlinear quantum photonics because of its 
large bandgap (about 6.2 eV), high material nonlinearities””* *° and low 
narrowband background fluorescence inthe spectrum (600-760 nm) 
of GeV and SiV centres’’. Using piezo-controlled micromanipulators, 
we transferred QMCs into the microchiplet sockets with a placement 
success rate of 90%. The diamond waveguides (width 340 nm and height 
200 nm) transfer light into the AIN waveguides (width 800 nm and 
height 200 nm) through inverse tapered sections with a simulated 
efficiency of 97% (98%) at a wavelength of 602 nm (737 nm) (Methods). 
The SEM image of an assembled device in Fig. 2g shows a transverse 
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Fig. 4| Defect-free arrays of optically coherent and efficient 
waveguide-coupled emitters. a, PLEspectrum (FWHM linewidth 

[=37(3) MHz, indicated by the arrows) of a single GeV in channel 41 with all-fibre 
excitation and detection of the phonon sideband (PSB) fluorescence routed 
on-chip via (1). b, Excitation via (2) and fluorescence detection via (1). This 
geometry allows GeV resonance fluorescence detection at least 18 dB above 
background, without spectral, temporal or polarization filtering. c, In 
transmission, a single GeV centre causes coherent extinction of A7/T=38(9)% 


placement error of 38 + 16 nm. For such typical errors, simulations 
indicate a drop in coupling efficiency by 10% or 0.46 dB. We find that 
the transfer of the QMCs is substantially easier than for individual 
waveguides due to their rigidity and many alignment features. The 
successful transfer of 16 defect-free chiplets results in a 128-channel 
photonically integrated quantum emitter chip, as characterized below. 

We performed experiments in a closed-cycle cryostat with a base 
temperature below4 K, as illustrated in Fig. 3a. The optical fibre labelled 
(1) couples pump light (fluorescence) to (from) the QMC via the AIN 
waveguides. A microscope objective also provides optical access to the 
QMC, for example, toa colour centre (optical interface labelled (2)) ora 
scattering site (labelled (3)). Figure 3b shows the energy level and emis- 
sion spectrum of asingle GeV when pumped through (2) and collected 
through (1). Off-resonant excitation using 532-nm light with off-chip 
pump filtering in this configuration enables the rapid identification 
of single emitters (indicated by a photon intensity autocorrelation 
function g®(0) < 0.5). The left panel in Fig. 3d shows a typical pho- 
ton antibunching (g”(0) = 0.19(7)) from a single GeV centre (channel 
41) pumped near saturation, without background or detector jitter 
correction. Under the resonant excitation at 602 nm of transition C 
(Fig. 3b) of the zero-phonon line (ZPL), the photon purity improves to 
g”(0) = 0.06(2) (middle panel in Fig. 3d). Similarly, in channel 65, we 
measured antibunched photons withg”(0) = 0.05(3) froma single SiV 
centre under resonant excitation at 737 nm (right panel in Fig. 3d). Inall 
128 integrated waveguides, shown in Fig. 3c, we identified single GeV 


(orange curve, /=35(15) MHz). The red curve shows the PLE spectrum 
([=40(5) MHz). d, PLE spectra of GeV centres in each waveguide (WG) ofa 
characteristic 8-channel GeV QMC, witha mean + standard deviation linewidth 
of f=54+24MHz.e, PLE spectra of SiVs in an 8-channel SiV QMC, with 
=146+20 MHz. We interpret the two lines in channel 69 as PLE spectra from 
two distinct SiV centres (g”(0) = 0.69(7) under off-resonant excitation, not 
shown). 


and SiV emitters using top excitation (through (2)) and fibre-coupled 
waveguide collection (through (1)). Their photon statistics are sum- 
marized in Fig. 3e. 

Next we investigated the optical coherence of a GeV centre using 
all-fibre spectroscopy. Figure 4a shows the photoluminescence excita- 
tion (PLE) spectrum of the channel-41 GeV as we scanned a resonant 
laser across its ZPL (transition C) with both excitation and detection 
through the fibre interface (1). Despite the presence of another emitter 
spectrally detuned by 50 GHzinthe same waveguide, resonant excita- 
tion allows the selective addressing and readout of single emitters. 
The measured linewidth of [=/,, + 2/,,=37(3) MHz (values in paren- 
theses indicates one standard deviation throughout this work), where 
I, is the pure dephasing rate of the emitter, is near the lifetime limit 
[)=1/21T = 24(2) MHz, as obtained from the excited-state lifetime T 
(Methods). 

The PIC geometry also enables the direct detection of ZPL resonance 
fluorescence without any spectral, temporal or polarization filter- 
ing, even under resonant excitation. Figure 4b shows the resonance 
fluorescence obtained for top excitation (through (2)) and waveguide 
collection without filtering in the detection via (1). By polarizing the 
pump electric field along the waveguide axis to minimize excitation of 
the transverse electric waveguide mode, this cross-excitation/detection 
configuration achieves a ZPL intensity 18 dB above the background, 
comparable to free-space diamond entanglement experiments using 
cross-polarization and time-gated detection*. 
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Sapphire 


Fig. 5| Controlling the optical transitions of colour centres ona PIC. a, We 
applied ad.c. bias between the metal layer Aul on diamond and metal Au2 on 
the substrate to electrostatically actuate the QMC. b, SEM image of the device. 
In this experiment, we investigated the optical response of emitters 1A, 1Band2 


According to finite-difference time-domain (FDTD) simulations, an 
ideal emitter in the optimal configuration has aspontaneous emission 
coupling efficiency of 6=0.8 into the diamond waveguide. Experimen- 
tally, we measured this efficiency by measuring the transmission of a 
laser field througha single GeV centre (Fig. 4c). By injecting a laser field 
through (3) and monitoring the transmission 7 via (1), we observed an 
extinction of 1- T=0.38(9) when on resonance with the GeV centre. This 
extinction places a lower bound of the emitter-waveguide cooperativity 
at C=0.27(10) and B=0.21(6). By accounting for residual line broaden- 
ing and for the ZPL emission fraction (about 0.6), the dipole-waveguide 
coupling efficiency is at least 0.55(18); see Methods for other factors 
that reduce f. 

The excellent coherence of the GeV centre in channel 41 is not unique. 
Figure 4d reports the linewidths of every channel in a characteristic 
8-channel GeV diamond chiplet, all measured through the on-chip 
routing of fluorescence into an optical fibre. We find amean + standard 
deviation normalized linewidth of [/[, =1.7 + 0.7, with GeV channels 41, 
45 and 48 exhibiting lifetime-limited values of 1.0(2), 0.9(1) and1.0(2), 
respectively. From these measurements, we also obtained the inhomo- 
geneous ZPL transition frequency distribution of 85 GHz. In these PLE 
measurements, we averaged each spectrum over about 5 min (5,000 
experiments), demonstrating the emitters’ long-term stability after 
heterogeneous integration. Similarly, as shown in Fig. 4e, we also find 
uniformly narrow lines from SiV centres across a QMC, with linewidths 
within a factor of //, =1.6 + 0.2 from SiV centres in bulk diamond”, and 
with an inhomogeneous frequency distribution of 30 GHz. 

To overcome the inhomogeneous spread in transition frequencies, 
we implemented a strain-tuning scheme using the electrical layers in 
our PIC. The fabricated device (Fig. 5a, b) uses a QMC that consists of 
waveguides with different lengths and beam rigidities (Extended Data 
Fig. 5). Strain is applied by a capacitive actuator consisting of one gold 
electrode (Au 1) onthe QMC layer, separated transversely by 1.5 um from 
a gold ground plane (Au 2) on the sapphire substrate. A bias voltage 
deforms the waveguide so the associated strain modifies the orbital 
structures and the optical transitions of embedded colour centres®*. 
This device geometry enables tuning ranges up to 100 GHz, which is 
larger than the inhomogeneous distribution and only limited by stiction 
between the QMC and the substrate (Methods). Owing to differences in 
dipole positions and orientations, we can spectrally overlap the optical 
transitions of, for example, emitters 1A and 1B in one waveguide at a bias 
of 24.5 V, as shown in Fig. 5c. Alternatively, they can also be selectively 
aligned with that of emitter 2, initially detuned by about 10 GHz in 
another waveguide channel, at distinct voltages. During strain tuning, 
we did not observe degradation in the linewidths in PLE scans lasting 
3 min (Extended Data Fig. 8); this long-term stability remained within 
150 MHz over 3 h of continuous measurement without feedback and 
unchanged up to a tuning of about 6.8 GHz (Extended Data Fig. 9). 

Although not demonstrated here, an array of electrodes could 
provide closed-loop tuning® on each waveguide-coupled emitter 
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to generate indistinguishable photons for Hong-Ou-Mandel inter- 
ference* using on-chip beamsplitters. On the basis of the emitter 
linewidths measured here over minutes without feedback, we estimate 
high-visibility interference of 0.9(0.8) for stable GeV (SiV) emitters such 
as those in channel 41 (69) and a visibility of 0.58 + 0.24 (0.63 + 0.07) 
when averaging over all emitters in Fig. 4d, e. The optical coherence and 
photon indistinguishability, which is critical for entangling operations, 
can be improved, for example, through the Purcell effect by coupling 
to photonic cavities”. 

The large-scale integration of artificial atoms with photonics 
extends to a wide range of nanophotonic devices, in particular, 
high-quality-factor diamond photonic crystal cavities”°*”** and other 
optically active spins” such as NV centres”, emerging diamond 
group-IV quantum memories”, quantum dots” and rare-earth ion 
dopants*!”. The advances reported in this work should therefore 
encourage further integration of photonic and electronic components 
for large-scale quantum-information-processing applications such as 
multiplexed quantum repeaters or modular quantum computers based 
onsolid-state spins’ ”?!°, Key components have already been individu- 
ally demonstrated, including photonic switch arrays and beamsplitter 
meshes** * for reconfigurable qubit connectivity and heralded spin 
entanglement, AIN-based high-speed electro-optic modulators” and 
SNSPDs”’, and custom complementary metal-oxide-semiconductor 
electronics*”“* for colour centre spin control and low-latency process- 
ing. As PIC applications ranging from optical communications” to 
phased-array light detection and ranging” to machine learning acceler- 
ators“ are pushing systems beyond thousands of optical components, 
the high-yield integration with arrays of high-quality artificial atoms 
provides a basis to extend these scaling gains to quantum information 
processing with spins and photons. 
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Methods 


Ionimplantation 

Extended Data Fig. 1summarizes the fabrication and integration pro- 
cesses. First, we relieved the strained surface of the single-crystal dia- 
mond plate (Elementé6) by plasma etching the first 10 um of diamond 
in Ar/Cl,, followed by another 5 pm etching in pure oxygen plasma. We 
used an FIB™ tool at the lon Beam Laboratory (Sandia National Labo- 
ratories) to implant Ge ions (spot size of about 35 nm x 43 nm) and Si 
ions (spot size of about 50 nm x 45 nm) at an effective areal dose of 2 x 
10"-6 x10" ions per cm’ and 4.5 x 10"-9 x 10" ions per cm’, respectively. 
The Ge (Si) ion energy is 200 keV (170 keV), which corresponds to an 
implantation depth of 74 +12nm (113 +19 nm) from stopping and range 
of ions in matter (SRIM) simulations”. After implantation, we annealed 
the devices at 1,200 °C in an ultrahigh vacuum furnace. Finally, we 
cleaned the diamond ina boiling mixture of 1:1:1 sulfuric acid, nitric 
acid and perchloric acid. 


Conversion yield of GeV and SiV centres 

We analysed the conversion yields of GeV and SiV centres by count- 
ing the absence of fluorescent spots in our implantation region (1-u~m 
pitch, square grid) using photoluminescence microscopy. A Poisson 
distribution P(k), with mean number of colour centresA and number of 
observed emitters per spot k, models the stochastic emitter creation 
process. From the meanA=-—log(P(0)) and our implantation dose, we 
estimate the conversion yield of GeV (SiV) centres to be about 1.9% 
(3.2%). 


Registration of emitters using optical localization 

We located and mapped the fabricated quantum emitters relative to 
prefabricated alignment markers using a wide-field and confocal scan- 
ning microscopeas shown previously. To demonstrate the microchip- 
let principle in this study, we registered the qubit grid, rather than each 
emitter’s location. In particular, we determined the global displacement 
of the emitter grid from the implantation process and used this offset 
in our subsequent electron-beam lithography of QMCs. We anticipate 
that the targeted placement of devices over pre-localized emitters*> *” 
willincrease the yield of emitter-coupled QMCs. Such approaches may 
also be critical for reducing the proximity of nanofabricated surfaces 
to emitters and for coupling to photonic crystal nanocavities. 


Quantum microchiplet fabrication 

Afterionimplantation and optical registration, we used a quasi-isotropic 
diamond etching recipe**”’ to fabricate suspended QMCs. In par- 
ticular, we deposited 180 nm of silicon nitride (SiN) hard mask using 
plasma-enhanced chemical vapour deposition. We patterned the SiN 
hard mask using a ZEP-520A electron-beam resist with ESpacer conduc- 
tive polymer and CF, reactive-ion etching. Subsequently, we used induc- 
tively coupled reactive-ion etching to transfer the pattern from SiN 
into the diamond layer. Following oxygen etching of the diamond, we 
deposited 15 nm of conformal alumina via atomic layer deposition. After 
a brief breakthrough etch of alumina, we etched the chip in zero-bias 
oxygen plasma to isotropically undercut the diamond QMCs. Finally, 
we removed the SiN and alumina masks in hydrofluoric acid. We again 
annealed the device at 1,200 °C using the above ultrahigh-vacuum, 
high-temperature annealing recipe, followed by a clean in a boiling 
mixture of 1:1:1 sulfuric acid, nitric acid and perchloric acid. 


AIN photonics 

AINis alarge-bandgap material (about 6.2 eV) that is suitable for linear 
optics, nonlinear optics and optomechanics, with an electro-optic 
coefficient of r,;=1pm V7 (ref.’’), second-order optical nonlinear sus- 
ceptibility of y? =4.7 pm V! (ref. ), athird-order optical nonlinearity 
(Kerr) coefficient of n, = 2.5 x 10° cm’ W" (ref. °°) and a piezoelectric 
coefficient of d,,~5 pm V7 (ref. 7°). Here we used a wafer of 200-nm-thick 


single-crystal AIN on a sapphire substrate (MSE Supplies, grown by 
hydride vapour phase epitaxy). Before processing of the AIN PIC, we 
patterned gold alignment markers to use for alignment between the 
photonic layer and the metal layers for strain tuning. We defined the 
AIN photonic circuitry using EBL (ZEP-520A electron-beam resist and 
ESpacer conductive polymer) and chlorine-based inductively coupled 
plasma reactive-ion etching)’®. Then, $1813 photoresist served as a 
protective layer for mechanical edge polishing. We then diced the chip 
using an automatic dicing saw (DISCO DAD-3240). We polished the 
chip to produce optical-grade facets for edge coupling (Allied Multi- 
Prep Polishing System 8). Finally, sonication in N-methyl-2-pyrrolidone 
removed the S1813 protective layer and debris caused by dicing and 
mechanical polishing. 


Metal layers 

The fabrication of the metal electrodes and contact pads ontop of the 
PIC substrate immediately followed the patterning of the thin-film AIN 
and preceded the chip dicing and edge polishing. The PIC substrate 
metal layer was defined by lift-off of 50-nm Au ontop of 5-nm Tiusinga 
single layer of A6 950K poly(methyl methacrylate) electron-beam resist 
(450 nm thick), which was aligned relative to the AIN PIC with metal 
alignment markers. Then, the fabrication of the AIN photonic circuitry 
proceeded to dicing and polishing, followed by integration of the QMC. 
After pick-and-place transfer of the QMC to the microchiplet socket, 
we used a targeted electron-beam metal deposition process to place 
platinum on the periphery of the QMC for electrical connection (FEI 
Helios NanoLab 600 DualBeam). This process also locked the QMC into 
place before resist spin-coating. Finally, we defined the metal electrode 
layer ontop of the QMC by lift-off of 15-nm Au on 5-nm Tiusing a single 
layer of A11 950K poly(methyl methacrylate) (2 pm thick). 


Yield of defect-free microchiplets 

Using photoluminescence spectroscopy, we investigated the occur- 
rence of defect-free 8-channel QMCs, as summarized in Extended Data 
Fig. 2. From this histogram, we estimate the probability of creating 
defect-free QMCs to be 39%. We note that this success probability 
depends ona variety of factors, including the alignment accuracy of 
the FIB implantation, the relative calibration between EBL and FIB, as 
well as the optical registration process. By deterministically placing 
each element of the QMC over pre-localized emitters, it should be pos- 
sible to boost the yield towards unity, allowing hundreds or thousands 
of quantum channels per chiplet. 


Pick-and-place transfer process 

We used piezo-controlled three-axis and rotation stages to align the 
QMC with the PIC”. In addition to the AIN waveguides, the QMC also 
rests ontop of multiple small AIN pedestals to prevent bowing of the dia- 
mond structures and stiction with the underlying sapphire substrate. 
In the case of an inaccurate placement, both the QMC and socket can 
be reused simply by picking the QMC and re-attempting the placement 
process. Experimentally, we have transferred a variety of arrays, ranging 
from single-channel devices all the way to 64-channel QMCs. We expect 
computer-controlled placement and self-alignment locking features to 
improve the transfer rate and to potentially fully automate the process. 


Experimental setup 

We used aclosed-cycle helium cryostat with a base temperature of 4K 
(Montana Instruments) witha top-access microscope objective (Mitu- 
toyo 100x ULWD, numerical aperture (NA) of 0.55). We used three-axis 
nanoposition steppers (Attocube ANP-x,z-50) and scanners (Attocube 
ANS-x,z-50) for edge coupling of optical fibres (lensed fibre with a 
spot size of 0.8 tm at 633 nm or a Nufern UHNA3 fibre) to the PIC. For 
photoluminescence (PLE) spectroscopy, we filtered the fibre-coupled 
fluorescence in free space using bandpass filters—Semrock FFO1- 
605/15 (FFO1-647/57) for GeV centres and FFO1-740/13 (FFO1-775/46) 


for SiV centres. We off-resonantly pumped GeV (SiV) using 532-nm 
(660-nm) lasers. Resonant excitation was achieved using a tunable laser 
(MSquared SolsTiS with an external mixing module). For PLE, we used 
acousto-optic modulators to excite GeV centres witha resonant pulse 
and an optional 532-nm charge repump pulse. For SiV centre experi- 
ments, we did not gate the resonant and repump optical pulses. Inthe 
resonance fluorescence detection experiment (Fig. 4b), we placed a 
half-wave plate before channel 2 to minimize laser coupling into the 
waveguide mode. To measure the excited-state lifetime of single emit- 
ters, we used time-correlated single-photon counting (PicoHarp 300) 
and a pulsed laser source (SuperK, filtered to 532 + 20 nm). We fitted 
the lifetime curves of the emitters in Fig. 4d with biexponential terms 
to account for fast decay and the slower fluorescence decay time con- 
stant. For strain-tuning experiments, we used a programmable voltage 
source (Keithley 2400) and observed negligible leakage currents (less 
than 0.2 nA) for all applied voltages in this experiment (up to 35 V). 


Spontaneous emission f-factor and dipole coupling with the 
waveguide mode 

The extinction inthe resonant transmission spectrum arises fromthe 
interference between the scattered and incoming optical fields, and 
its depth depends on the dipole-waveguide coupling B=Iye/VFyg+l), 
where/,,, is the emission rate into the waveguide mode and /” the decay 
rate into all other channels. For the measurement in Fig. 4c, we first 
characterized the saturation response of the emitter when excited via 
(3). At the low-excitation limit, the cooperativity C can be extracted 
from T= (1- B)?=(1+ C)?. By also accounting for line broadening of 


2F,/T, = 0.33(14), we determined £ via®* T = 1- whe , which 


reduces to the usual expression”? 7 ~ (1- B)?in the absence of broaden- 
ing and far from saturation S <1. In this experiment, we operated at 
S107 and all errors denote the fit or propagated uncertainties. The 
discrepancy of the experimental 6 = 0.21(6) (0.55(18) after correcting 
for broadening and a ZPL branching ratio of 0.6) with the simulated 
£=0.8 using the three-dimensional (3D) FDTD method (Lumerical) 
arises from three possible sources: (1) angular and positional misalign- 
ment of the dipole in the waveguide; (2) a finite populationin the upper 
ground state and emission into transition D; and (3) possible 
non-radiative processes. 


Diamond-PIC coupling 

Extended Data Fig. 3a, b shows the normalized electric |E| field of 
602-nm (737-nm)-wavelength transverse electric light coupling from 
the diamond waveguide (340 nm x 200 nm) to the bottom AIN wave- 
guide (800 nm x 200 nm), calculated using the 3D FDTD method. 
The light transfers adiabatically via tapered sections in the diamond 
waveguide and AIN waveguide. Here the coupling region is 9 um long, 
with a diamond taper length of 8 pm and AIN taper length of 5 um. 
The top insets show 2D transverse cross-sections of the light propa- 
gation. The cross-sections at y=—10 um and y=1pmcorrespond to 
the fundamental transverse electric mode of the diamond waveguide 
and AIN-on-sapphire waveguide, respectively. The cross section at 
y= -5 um (y=—6 pm) is the point where half of the light launched from 
the diamond waveguide is transferred to the AIN waveguide at 602-nm 
(737-nm) wavelength. The light from the diamond waveguide couples to 
the AIN waveguide with 97% (98%) efficiency at these wavelengths, with 
all ofthe light coupling preferentially to the AIN fundamental transverse 
electric mode and negligible coupling to higher-order modes. This 
optimized device geometry was determined by optimizing for the 
coupling efficiency from the fundamental transverse electric mode of 
the diamond waveguide to the fundamental transverse electric mode of 
the AIN while sweeping the diamond taper length, the AIN taper length, 
and the overlap region between the diamond and AIN waveguides. In 
Fig. 2g, we showed a typical transverse placement error of 38 +16nm 
for our transfer placement of the QMC to the microchiplet socket; in 


simulation, this displacement corresponds toa decrease of the coupling 
efficiency to 93% (89%) at 602-nm (737-nm) wavelength. Hence, we 
havea 0.46-dB tolerance inthe coupling efficiency within our transfer 
placement accuracy. By directly measuring the PIC-diamond-PIC 
transmission efficiency, we found the interlayer coupling efficiency to 
be greater than 34%, which was lower than simulations probably due 
to scattering at the interfaces and the QMC cross-junctions. 


PIC-fibre coupling 

We couple laser and photoluminescence to and from AIN-on-sapphire 
waveguides using lensed fibres (Nanonics Imaging, SM-630 with 
spot size 0.8 + 0.3 pm and working distance 4 + 1 pm) for cryostat 
experiments and ultrahigh NA fibres (UNHA3) for room-temperature 
experiments. Under our single-mode operation at 602-737 nm, the 
in-coupling efficiency is the same as the out-coupling efficiency of 
AIN waveguide to lensed fibre, which we find to be 51-57% using the 
3D FDTD method. In practice, the PIC-fibre coupling efficiency, which 
we find to be about 11% in our devices, is sensitive to the edge coupler 
polishing quality. For the high-NA fibre, which is multimode at our 
wavelengths of interest, we find the numerical out-coupling efficiency 
to the fundamental fibre mode to be 25% (34%) at 602 nm (737 nm); 
there is also 1% (3%) coupling into higher-order modes. 


System efficiency 7 

Extended Data Fig. 4a shows the response from an idealized emitter 
system, fitted to F=F,.,P/(Psar+ P) + cP, where Pis the continuous-wave 
532-nm excitation pump power, cPis the linear background, P,,,=1.2 mW 
isthe saturation power, Fis the measured ZPL fluorescence at the detec- 
tor and F,,,=1.11 megacounts per second (Mcps). Extended Data Table 1 
shows an average saturated count rate of 0.64 + 0.36 Mcps from an 
array of GeV waveguides in a QMC. To independently measure the sys- 
tem efficiency at the detector, N.ystem, We USed a pulsed source (SuperK 
Extreme, 532 +20 nm) witha repetition rate of 26 MHz. From the satura- 
tion response (Extended Data Fig. 4b), we determined F,,, = 0.25 Mcps 
and Neystem= 0.72%. This experimentally determined efficiency is within 
a factor of five from the independently calculated Ngystem = O.5BNpiclribr 
Meetup ~ 2.6%, where (B, Npics Neibrer Nsetup) ~ (0.55, 0.34, 0.33, 0.58) are the 
dipole-waveguide, diamond-PIC coupling, PIC-fibre coupling and 
external setup detection efficiencies, respectively. Here the factor of 
0.5 accounts for the present configuration in which we collected the 
photon emissionin one direction only. In these saturation experiments 
at room temperature, we used a lensed fibre with 2.5-t1m spot size at 
1,550 nm, which we find to have N¢4,- = 33%. We attribute the discrep- 
ancy tothe non-unity radiative quantum efficiency of the emitter and 
deviations in B, npc and Nrpre from independent measurements based on 
another device. In the next subsection, we outline methods to improve 
the system efficiency. 


Improving the system efficiency 

In our experiments, the uncladded microchip enables the hetero- 
geneous integration of QMC but the mode mismatch between the 
AIN-on-sapphire waveguide and the lensed (high NA) fibre causes at 
least 3 dB (5 dB) insertion losses as characterized above. It is possible 
toimprove the mode overlap by cladding the microchip with alumina 
or with materials with similar refractive indices as the underlying sap- 
phire’®. Insucha scheme, we would taper down the AIN waveguides at 
the chip facet to better mode-match with the lensed fibre. Our cladded 
edge coupler design can substantially improve the coupling from the 
AIN waveguide to the lensed fibre to be 84% (84%) at 602 nm (737 nm), as 
well as improve the coupling to high NA fibre to be 89% (91%) at602nm 
(737 nm), using the present AIN-on-sapphire material and film thick- 
ness. In this design, we matched the mode field diameters and reduced 
the effective refractive index mismatch between the fundamental trans- 
verse electric modes at AIN edge coupler facet and the lensed (high 
NA) fibre focus spot (facet). Owing to the index mismatch, the light 
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coupling is limited by Fresnel reflections at the waveguide facet, which 
can be reduced using an index-matching environment. Finally, on-chip 
reflectors in diamond can increase the photon collection efficiency by 
a factor of two, and photonic crystal cavities can boost the emission 
into the waveguide mode. 


Strain tuning scheme of QMC on PIC 

We introduced different optical responses to our emitter QMC by 
changing the length of their constituent waveguides. Here we used 
waveguides of length 20 pm (type I) and length 15 pm (type Il). To be 
compatible with the QMC framework, we included a flexible bridge 
between type II waveguides and the QMC body (Extended Data Fig. 5a, 
Fig. 5a). Extended Data Fig. 5b confirms the difference in strain response 
at 30 V (modelled using COMSOL Multiphysics) between type I and 
type Il waveguides. 


Response of optical transitions to strain 

We consider single GeV centres (emitter 1A, emitter 1B, emitter 2) indi- 
cated in Extended Data Fig. 5a, Fig. 5a, b. Extended Data Fig. 6 plots the 
spectral response of the optical transition lines up to an applied voltage 
of 30 V. From the increasing line splitting ofthe orbital ground states A,, 
that is, between lines C and D (as well as A and B), we find that emitter 1B 
isa dipole whose axis lies in the transverse plane**+ of the waveguide. 
On the basis of the unidirectional shift of all four lines, emitters 1A and 
2 are dipoles oriented in the longitudinal cross-sectional plane of the 
waveguide®**. In particular, the global blueshift of the lines of emitter 
1A indicates that it resides in a region with compressive strain (that is, 
below the neutral axis of the mechanical beam). Conversely, the opti- 
cal lines of emitter 2 redshifts with applied voltage, indicating that it 
resides in aregion with tensile strain, whichis located above the neutral 
axis of the waveguide. Extended Data Fig. 7 shows the robustness of the 
strain-tuning mechanism as we repeatedly applied voltages from 10 V 
to 26 V. Above 30 V, we see over 100 GHz of tuning of the two brightest 
transitions C and D for emitters 1A and 2; however, we note that in this 
regime there was hysteresis possibly due to stiction with the underlying 
gold and substrate about 150 nm and 200 nm away, respectively. Never- 
theless, for the purpose here, we were able to spectrally overlap any pair 
of the three emitters with less than 25 V. Revised electrode, QMC and/or 
PIC designs in future microchips should be able to extend the spectral 
shift of individually tunable waveguides. We note that the small ‘pull 
in’ voltage in our experiment appears earlier than it does in simulation 
(over 250 V)—possibly due to the surface conductivity of diamond. 


Stability of optical transitions 

We investigated the optical stability of an emitter during spectral tuning 
via strain. Here we monitored a GeV in another chiplet with an identical 
electrode configuration (the earlier device used for Fig. 5 and Extended 
Data Figs. 5-7 was no longer available due to an accident). Extended Data 
Fig. 8 shows the centre frequency shift (bottom) of the ZPL transition 
and its linewidth (top) as a function of voltage. In these PLE linescans 
under strain, we averaged each spectrum over 2,000 experiments (about 
3 min) and did not observe substantial degradation in the linewidth. 
We then tracked the ZPL at various voltage biases under repeated PLE 
measurements over 3 h. Extended Data Fig. 9 shows the long-term 
ZPL stability to within 150 MHz for spectral tuning up to 6.8 GHz. At 
higher tuning ranges, the linewidths were unchanged but there was an 
increase in spectral diffusion of the centre frequency, probably due to 
aninduced permanent dipole moment that increased susceptibility to 
charge fluctuations. Nevertheless, at a tuning of 20 GHz, the FWHM 
of the inhomogeneous distribution remained under 250 MHz, whichis 
within a factor of four of the initial linewidths of about 60 MHz. 


Data availability 


The datasets generated during and/or analysed during the current study 
are available from the corresponding author on reasonable request. The 
data that support the findings of this study are also openly available in 
figshare at https://doi.org/10.6084/m9.figshare.11874291. 
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Extended Data Fig. 1| Flowchart for large-scale heterogeneous integration. See main text and methods for process descriptions. 
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Extended Data Fig. 2| Histogram of number of emitter-coupled waveguides within a QMC. The red coloured bar corresponds to the defect-free 8-channel 
QMCs that were suitable for integration. The orange coloured bars correspond to the QMCs that we did not use in this work. 
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Extended Data Fig. 3 | FDTD simulation showing propagation of light from the diamond waveguide into the AIN waveguide. a, For a602-nm wavelength 
(corresponding to the GeV colour centre ZPL). b, For a737-nm wavelength (corresponding to the SiV colour centre ZPL). 
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Extended Data Fig. 4| Saturation response ofa single GeV centre. a, Continuous-wave 532-nm laser excitation b, Pulsed laser excitation at 532nm witha 
repetition rate of 26 MHz. 
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Strain distribution along the waveguides and emitters considered inthe main straggle. 
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Extended Data Fig. 6| Spectral shift of GeV centres in response to strain fields. a—c, Strain response of emitter 1A (a), emitter 1B (b) and emitter 2 (c). 
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Extended Data Fig. 7 | Spectral shifts for the brightest transitions. Reproducible spectral shifts between 10 V and 26 V for the two brightest transitions Cand D 
for emitter 2. 
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Extended Data Fig. 8 | Optical properties during strain tuning. Top: PLE linewidths as a function of voltage. Bottom: corresponding frequency shift, Av, of the 
ZPL transition. 
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Extended Data Fig. 9 | Stability of the ZPL transition frequency during strain tuning. Each time slice corresponds toa single PLE linewidth measurement 
averaged over 2,000 experiments (about 3 min). 
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Extended Data Table 1| Saturated count rates from single GeV centres ina QMC 
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Active optical control over matter is desirable in many scientific disciplines, with 
prominent examples in all-optical magnetic switching”, light-induced metastable or 
exotic phases of solids? * and the coherent control of chemical reactions®””. Typically, 
these approaches dynamically steer a system towards states or reaction products far 
from equilibrium. In solids, metal-to-insulator transitions are an important target for 


optical manipulation, offering ultrafast changes of the electronic’ and lattice 


11-16 


properties. The impact of coherences on the efficiencies and thresholds of such 
transitions, however, remains a largely open subject. Here, we demonstrate coherent 
control over a metal-insulator structural phase transition in a quasi-one-dimensional 


solid-state surface system. A femtosecond double-pulse excitation scheme 


17-20 is used 


to switch the system from the insulating to a metastable metallic state, and the 
corresponding structural changes are monitored by ultrafast low-energy electron 
diffraction”. To govern the transition, we harness vibrational coherence in key 
structural modes connecting both phases, and observe delay-dependent oscillations 
inthe double-pulse switching efficiency. Mode-selective coherent control of solids 
and surfaces could open new routes to switching chemical and physical 
functionalities, enabled by metastable and non-equilibrium states. 


Femtochemistry entails the search for understanding and control of 
ultrafast reaction pathways’”°. To this end, coherences in the electronic 
and vibrational states of reactants are used to affect transitions ina 
complex, generally multidimensional energy landscape’™. Established 
for small molecules, the possible transfer of this concept to extended 
systems and solids is complicated by, for example, the high electronic 
and vibrational density of states, and by couplings to an external heat 
bath. Low-dimensional and strongly correlated systems represent a 
promising intermediate between molecules and solids, with phase 
transitions assuming the role of a ‘reaction’. Some of these transitions 
can be driven optically by means of transient heating”, electronic 
excitation” °*>”¢ or direct resonant coupling to certain vibrational 
degrees of freedom**. 

The prototypical case of a phase transition governed by structural 
modes is given by the Peierls instability”’, in which a metal-to-insulator 
transition is linked to phonon softening and the appearance of astatic 
periodic lattice distortion. Coherent oscillations of the periodic lattice 
distortion, known as amplitude modes or amplitudons, are frequently 
observed in the optical pumping of such transitions, especially close 
to their thresholds”* °°. In analogy to the vibrational spectroscopy 
of reacting molecules*, amplitudons can be used to track ultrafast 
changes in the lattice symmetry across a phase transition”’”’. How- 
ever, it remains to be shown how coherent amplitude motion can be 
used to manipulate the outcome of a structural transition. 

Here we report coherent control over the phase transition in a 
quasi-one-dimensional Peierls insulator by means of the amplitudes 
of specific phonon modes. We use a double-pulse excitation scheme 


and monitor the structural transformation by ultrafast low-energy 
electron diffraction (ULEED; Fig. 1a, see Methods)”. Observing the 
resulting structure as a function of the double-pulse separation dem- 
onstrates the importance of shear and rotational phonon modes on 
the femtosecond timescale. A comparison of ULEED and transient 
reflectivity measurements suggests distinct roles of these phononsin 
controlling the transition, and points to the location of the transition 
state along the mode coordinates. 

As amodel system, we study atomic indium wires on the (111) sur- 
face of silicon*’, a prominent Peierls system attracting interest for its 
ultrafast dynamics®*. Arranged ina zigzag pattern, the indium atoms 
inducea metallic (4 x 1) superstructure, which, at critical temperature 
T, =125 K, exhibits a first-order transition to an insulating state with 
quadrupled (8 x 2) unit cell size and a hexagon-shaped indium pattern. 
The associated change in atomic structure causes additional spots in 
backscattering diffraction (see LEED patterns in Fig. 1b). Below T,,a 
single optical pump pulse is able to electronically excite the system to 
ametastable (4 x 1) state’ °. Time-resolved diffraction and photoemis- 
sion spectroscopy have recently revealed the ultrafast and ballistic 
nature of this transition (occurring on a 350-fs timescale) and identi- 
fied excited electrons and localized photoholes as its driving force. 

Tracking the (4 x 1) and (8 x 2) diffraction spot intensities in ULEED, 
we observe a rapid increase/decrease directly after optical excitation 
and subsequent relaxation to a level persisting over nanoseconds 
(Fig. 1c, left), indicating the metastability of the structure”. Interest- 
ingly, this long-lived contribution displays a rather gradual thresh- 
old in pump fluence. This implies that for intermediate excitation 
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Fig. 1| Ultrafast LEED set-up and structural phase transitionin atomic 
indium wires on silicon. a, Experimental scheme. Ultrashort electron pulses 
froma miniaturized laser-driven electron gun are used inaLEED experiment to 
monitor the microscopic structure of atomic indium wires on the Si(111) 

surface after optical excitation by single or double pulses. White, silicon atoms; 
violet, indium. b, Selected regions and line profiles from LEED patterns of the 
metallic (4 x 1) and insulating (8 x 2) phases (white frame ina). The emergence of 


densities, a variable part of the surface is switched to the metastable 
state (Fig. 1c, right), despite homogeneous excitation of the probed 
area (see Methods). Aninterpretation based onthe coexistence of both 
phases is also corroborated by scanning tunnelling microscopy” and 
Raman spectroscopy” well below 7.. 

It seems likely that near the threshold, the structural transition 
is particularly susceptible to weak perturbations, affecting the effi- 
ciency of driving the system to the metastable state. Motivated by 
control schemes in femtochemistry” °, we explore the use of pulse 
sequences to manipulate the switching efficiency. Specifically, we use 
a pair of optical pump pulses with variable delay At, _, and probe the 
resulting structure by ULEED ata later time of At,_..= 75 ps, well after 
the excitation. We find that the signature of the metastable state—that 
is, asuppression of the (8 x 2) and increase of the (4 x 1) phase—is a 
strong function of the double-pulse delay At,_, (Fig. 2a). Importantly, 
at intermediate fluences between 0.5 mJ cm” and 1.4 mJ cm”, pro- 
nounced oscillations with a period of 1-2 ps are observed on either 
delay side, with opposing behaviour for the (4 x 1) (top panel) and 
(8 x 2) spots. By contrast, only a minor delay-dependence is found 
well below and above threshold. The peaked signal around At, _,=0 
is attributed to additive electronic excitation which decays ona 
few-picosecond timescale’®. 

The observed oscillations clearly demonstrate a coherent response 
of the signal. In particular, coherent vibrational motion induced by the 
first pulse controls the switching efficiency for the second pulse. The 
frequency content of the signal (Fig. 2b, top) points to shear and rota- 
tion phonon modes, which have previously been identified as amplitude 
modes of the metal-insulator phase transition” (see Fig. 2c). Inter- 
estingly, Fourier-filtered traces of the two observed frequency bands 
exhibit opposite phases at time zero, corresponding to enhancement 
or suppression of the transition for pulse overlap (see Fig. 2b, bottom). 

The appearance of coherent phonons is understood within the estab- 
lished potential energy model of the transition“. The (8 x 2)>(4 x1) 
transformation is typically described in terms of a tristable energy 
surface, with an initial minimum at the (8 x 2) configuration. Electronic 
excitation tilts the balance towards the (4 x 1) phase (Fig. 3a)? >*°, 
accompanied by displacive excitation of coherent phonons (DECP)*. 
The question now arises as to how these phonon coherences enable 
control over the transition. 
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additional spots in the (8 x 2) phase indicates the pronounced structural 
changes during the phase transition. c, Left, time-resolved integrated 
intensities of (4 x 1) and (8 x 2) diffraction spots as a function of the pump- 
probe delay At, _,,. Right, fluence-dependent spot intensities recorded at 
At, «= 75 ps. The (4 x 1) and (8 x 2) intensities have been normalized to 
corresponding valuesat At,_.,< 0. Errorbars, experimental uncertainty in 
fluence determination. 


Generally, Raman-active phonons can modulate the optical absorp- 
tion ofasurface>”’, affecting the level of electronic excitation achieved 
by the second pulse. This can influence the final-state potential energy 
surface and the observed threshold (‘absorption control’; see Fig. 3c). 
Amplitude modes of the symmetry-broken state are expected to play 
aroleinthis mechanism, given their direct link to the structural trans- 
formation and their susceptibility to strong displacive excitation. 
Moreover, the ballistic nature of the transition” and the influence of 
DECP suggest that kinetic energy contributes to overcoming a suffi- 
ciently lowered but not completely vanishing barrier (‘ballistic control’; 
see Fig. 3b). For the vibrational motion along a reaction coordinate, 
in-phase excitation with a second pulse maximizes the effect of DECP 
and allows barrier-crossing to the (4 x 1) state (1). Anti-phase excitation, 
onthe other hand, vibrationally de-excites the system, which then has 
insufficient kinetic energy and remains in the (8 x 2) state (2). Inacor- 
responding real-space picture, by weakening or strengthening differ- 
ent indium-indium bonds® and thus shifting the equilibrium atomic 
positions, the second pulse either adds further mechanical stress to 
the system (1) or removes it (2). Whereas the absorption modulation 
described above may apply to all Raman-active modes q (ref. °), this 
ballistic contribution is only feasible for modes along the reaction 
coordinate Q. 

To further elucidate the contributions of these mechanisms, we 
complement ULEED by optical pump-probe (OPP) spectroscopy 
(see Methods for details), which probes absorption modulation by 
coherent phonons”. We measure pump-induced changes in the optical 
reflectivity, which are directly proportional to absorption changes for 
amonolayer ona substrate with real refractive index. Acomparison of 
the switching efficiency (Fig. 3d) with the transient reflectivity (Fig. 3e) 
reveals both similarities and stark differences in the observed frequen- 
cies and their relative amplitudes (see also Extended Data Fig. 6 for 
further OPP traces). 

Both measurements exhibit a frequency component close to 
0.82 THz, which we assign to the hexagon rotation mode of the (8 x 2) 
structure’. Owing to its anti-phase behaviour at At, _, = 0 (see Fig. 2b), 
for this mode, a ballistic control mechanism can be ruled out. We 
therefore attribute the rotation-mode ULEED signal to absorption 
modulation. This assignmentis further corroborated by the modulation 
amplitudes in both types of measurements, which are linked through 
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Fig. 3 | Control mechanisms and comparison between ULEED and optical 


pump-probe spectroscopy. a, Phase transition model based on reshaping of 
the tristable energy surface by asingle pump pulse. For simplicity, the second, 


energetically degenerate (8 x 2) minimum is not depicted. Note that the 
potential deformation is acontinuous function of the excitation density. 
b,c, Coherent control mechanisms in double-pulse experiments: ballistic 
control (b) and absorption control (c). d, Relative switching efficiency 
recorded for unequal pump pulses in ULEED (F939 = 0.48 mJ cm”, 
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frequencies of structural modes given inc. FT, Fourier transform. Bottom, 
Fourier-filtered contributions of different frequency components. Brown and 
pink shaded regions indicate the distinct initial phases. a.u., arbitrary units. 
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Fgo9= 0.15 mJ cm”), corresponding spectral density (right) with reference 
frequencies (see Fig. 2c) and short-time Fourier transform (top). A, S and Rot 
indicate the frequencies of antisymmetric, symmetric and rotation modes, 
respectively. e, Delay-dependent reflectivity changes AR/R of the surface 
measured in optical pump-probe experiments and corresponding spectral 
density (Fyump= 0.15 mJ cm’). At, _,,, delay between optical pump and optical 
probe pulses. 


the total absorption of the monolayer. Our measurements predict 
a value of about 1% for the absorption, similar to a recent estimate” 
(see Methods). 

Amore intricate situation is found for the low-frequency compo- 
nent, associated with shear phonons: this dominant feature in ULEED 
modulates the switching efficiency to a disproportionately higher 
degree than expected from the overall transient reflectivity. Moreover, 
the shear mode frequencies measured by ULEED (0.57 THz) and OPP 
(0.64 THz) differ significantly. This may be a result of OPP probing a 
surface-averaged optical response, whereas ULEED is sensitive to the 
transition probability in regions close to threshold. However, den- 
sity functional theory and Raman spectroscopy in fact predict two 
separate shear modes: whereas the symmetric shear mode (expected 
at 0.66 THz) is much more prominent in Raman spectra”, only the 
antisymmetric shear mode (0.55 THz) is considered relevant for the 
transition****. These distinct properties suggest that each of OPP 
and ULEED mainly probes a different one of these modes, namely the 
higher-frequency symmetric and the lower-frequency antisymmetric 
shear oscillation, respectively. 

From these considerations, we extract two possible scenarios for 
the role of shear motion, linked to the control mechanisms discussed 
above (Fig. 3). First, if the transition is indeed driven by a shear mode 
separate from that seen in reflectivity, we must invoke the ballistic 
mechanism (Fig. 3b) to explain the ULEED data, directly linking this 
mode to the reaction coordinate. Alternatively, to identify the shear 
contributions in ULEED and OPP with the same mode and absorption 
modulation (Fig. 3c), the observed frequency difference requires fur- 
ther explanation. In particular, this would necessitate a greatly softened 
and larger-amplitude shear mode oscillation only in surface regions 
that can be switched by the second pulse, with an unaltered rotation 
frequency (Fig. 3d). 

Both scenarios imply that the shear displacement corresponds 
to the primary reaction coordinate, whereas the rotation com- 
pleting the transition®’*” is of asecondary nature. Accordingly, we 
propose a description of the transition in terms of atwo-dimensional 
potential-energy surface spanned by the rotation and shear defor- 
mations of the (4 x 1) structure (Fig. 4a), with the system initially 
residing in the (8 x 2) minimum. In areasonable assumption, the first 
pulse induces a displacive excitation of coherent phonons towards 
the (4 x 1) state. The ULEED measurements show that the transition 
efficiency for the second pulse becomes a strong function of the 
momentary vibrational state (Fig. 4b), denoted by the colour-coded 
area in Fig. 4b. The combined observations—that is, the differences in 
frequency and relative amplitudes between ULEED and OPP, as well 
as the phases of both modes in the double-pulse traces (Fig. 2b)—now 
suggest an ‘off-diagonal’ transition state in configuration space with 
a strongly reduced shear but a largely unaltered rotation (compared 
with the (8 x 2) state). This interpretation is further supported by 
the transient softening and hardening (Fig. 3d) of the shear and 
rotation mode, respectively, near At,_, = 0 (Fig. 3d), which we have 
consistently observed in a number of experiments (see Extended 
Data Fig. 7). 

In the proposed pathway of the transition, overcoming an ‘early’ 
barrier”, the In-chains are first ‘un-sheared’ and subsequently trans- 
formed into the zigzag structure by a rotation. Such a pathway tran- 
siently passes the ‘trimer’ state****, which has been intensely studied 
by density functional theory and is expected to be almost energetically 
degenerate to the (4 x 1) state*. The existence of two local minima 
for asimilar rotation displacement, namely the trimer and the (8 x 2) 
configuration’, also supports the existence of a transition state along 
the shear axis. 

Future experimental and theoretical studies, involving density 
functional theory and molecular dynamics simulations, may further 
elucidate the potential energy surface, possible additional pathways 
and the sequential nature of the transition. The microscopic excitation 
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Fig. 4| Two-dimensional picture of the phase transition dynamics. 

a, Proposed two-dimensional model of the potential energy surface for the 

(8 x 2)>(4x 1) inshear/rotation configuration space, exhibiting a transition 
state along the shear axis from the (8 x 2) state. b, Sketch of exemplary system 
trajectories close to the (8 x 2) state before (top), in between (middle) and after 
(bottom) two subsequent displacive excitations (yellow, At, ,=0; red, 

At,» Tro/2; Violet, At, ~ Tynear/2)- Tenear AN T,o¢, oscillation period of the 

shear and rotation mode, respectively. The phase transition efficiency 
(colour-coded) is a strong function of the vibrational coordinates at the time of 
the second pulse (middle and right panel). Highest efficiency is achieved fora 
maximum sheared/minimum rotated structure (see middle panel). 


mechanism underlying the phonon coherences deserves further con- 
sideration, including its link to the femtosecond electron transfer and 
hole-induced driving forces recently described*. Finally, consider- 
ing the surface heterogeneity, the influence of frequency changes 
at domain boundaries” on the local transition dynamics will be of 
interest. 

Our results demonstrate the coherent control of a surface struc- 
tural phase transition by all-optical manipulation of key phonon 
modes, and show that the outcome of the phase transition, much 
like many chemical reactions, depends on the momentary state of 
the coherent vibrational wavepacket. Close to the transition thresh- 
old, both absorption modulation by Raman-active modes and the 
ballistic motion of the order parameter in overcoming the barrier 
should be considered. The latter contribution could be enhanced 
with mode-selectivity by a repeated stimulation of the coherent 
phonon amplitude, which, as in the present system, decays slower 
than the electronic excitation’®***. 

In molecular chemistry, it has long been known that vibrational 
excitation and the location of the transition state may greatly affect 
reaction rates, a principle captured by the Polanyi rules”. Our work 
extends this principle to surfaces and solids and introduces the vibra- 
tional phase as a decisive parameter to target the transition state. We 
believe that exploiting vibrational coherences in low-dimensional 
and strongly correlated materials, as well as molecular adsorbates, 
holds promise for structural and electronic control in surface physics 
and chemistry, providing a handle to steer physical functionality and 
chemical reactivity. 
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Methods 


Ultrafast LEED set-up 

We recently developed ULEED in an optical-pump/electron-probe 
scheme for the time-resolved investigation of structural dynamics 
at solid-state surfaces””**’, LEED is a surface-sensitive technique, in 
which the diffraction pattern of electrons backscattered froma sample 
is analysed to obtain information about the surface structure*®. 

To achieve high temporal and momentum resolution, we use a 
laser-driven electron gun consisting of ananometric tungsten tip as 
well as four metal electrodes (outer diameter 2 mm, aperture diameter 
400 um), which act as asuppressor-extractor unit and an electrostatic 
einzel lens”. Electron pulses are generated by localized two-photon 
photoemission by illuminating the tip apex with femtosecond laser 
pulses (central wavelength 400 nm, pulse duration 45 fs, pulse energy 
20 nj) at repetition rates up to 100 kHz (note that the data presented 
in Figs. 2b and 3d were recorded at a repetition rate of 25 kHz whereas 
all other ULEED measurements were carried out at a repetition rate of 
100 kHz). The needle cathode provides a reduced electron beam emit- 
tance, allowing for amomentum resolution in diffraction of 0.03 A7. 
Moreover, we lower the dispersion-induced broadening effect on the 
electron pulse by decreasing the propagation length between the elec- 
tronsource andthe sample. In this respect, the reduced dimensions of 
the electron gun allow for operational distances of a few millimetres 
at areasonably small fraction of shadowed electron diffraction signal, 
resulting in electron-pulse durations down to 16 ps (depending on 
gun-sample distance)”. The backscattered electrons from the surface 
are amplified and recorded by acombination of achevron microchan- 
nel plate, a phosphor screen and a cooled sCMOS (scientific comple- 
mentary metal-oxide-semiconductor) camera resulting in typical 
integration times of t,,,=20s per frame in time-resolved measurements. 

In ULEED pump-probe experiments (Fig. Ic), the surface structure is 
excited by ultrashort light pulses (A,=1,030 nm, ha, =1.2 eV, At=212 fs) 
froman Yb:YAG amplifier system and probed by electron pulses (kinetic 
energy E,i,= 80 eV) ata variable time delay At,_.,. To ensure a homoge- 
neous excitation of the area probed by the electrons, we expand the 
optical pump beam to (297 +13) pm x (223 +14) uminthe sample plane, 
which is considerably larger than the focal spot size of the electron 
gun (<80 pum x 80 pm). The electron beam diameter corresponds to at 
least hundreds of structural correlation lengths (taken from scanning 
tunnelling microscopy literature; see, for example, refs. 3°4?°°), thus 
averaging over a large ensemble of local configurations. 

For the coherent control of the structural phase transition between the 
(41) and the (8 x 2) phase (Fig. 2a,b; Fig. 3d), we use two pump pulses P, 
and P, with distinct central wavelengths A, (P;:A.=1,030nm, ha, =1.2 eV, 
AT=212 fs; P,:A,= 800 nm, hw, =1.55 eV, At= 232 fs) from the amplifier 
system and an optical parametric amplifier (OPA) to avoid interference 
effects around time-zero (coherent artefacts). The P, and P, beams are 
aligned collinearly and subsequently focused onto the sample by asingle 
lens (see Extended Data Fig. 1a). To determine the temporal overlap of 
the pump pulses, we perform cross-correlation measurements using a 
fast nonlinear photodiode (GaP) (see Extended Data Fig. Ib). Asketch of 
the experimental set-up is depicted in Extended Data Fig. la. 


Optical pump-probe set-up 

To investigate the optical absorption modulation caused by struc- 
tural modes of the indium monolayer, we use an optical pump-probe 
set-up to measure the transient reflectivity of the In/Si(111) surface (see 
Extended Data Fig. 5). In this,a pump pulse (A,=1,030 nm, f,.,=100 kHz) 
induces coherent phonon oscillations, and the resulting reflectivity 
changes are monitored by a probe pulse (A,=800 nm, f,.. = 100 kHz) as 
a function of the time-delay At, _,,. The pump intensity is modulated at 
afrequency f,,94= 25 kHz by an acousto-optic modulator synchronized 
tothe laser system. Pump and probe pulses are collinearly focused on 
the sample at an incident angle a = 31°. The reflected beam is guided 


through two short-pass filters (2 x OD4 for A > 900 nm) and focused 
onto a silicon photodiode. The photodiode and reference signals are 
processed ina lock-in amplifier, yielding the data presented in Fig. 3e 
and Extended Data Fig. 6. 


Sample preparation 

All experiments were carried out under ultra-high-vacuum conditions 
(base pressure p <2 x 107° mbar) to minimize surface defects from 
adsorption, which were found to have an influence on the formation 
of the low-temperature (8 x 2) phase as well as the lifetime of the meta- 
stable state’**". The samples were prepared by flash-annealing Si(111) 
wafers (phosphorus-doped, resistivity R= 0.6-2 Ocm) at T,,,,=1,250 °C 
through direct current heating (maximum pressure during flashing 
was kept below Pynax = 2 X 10°’ mbar). Evaporation of 1.2 monolayers of 
indium onto the resulting Si(111)(7 x 7) surface reconstruction at room 
temperature followed by subsequent annealing at 7=500 °C for 300s 
resulted in a high-quality (4 x1) phase, as verified in our ultrafast LEED 
set-up. After inspection of the (4 x 1) phase, the samples were imme- 
diately cooled to a base temperature of 7= 60 K with an integrated 
continuous-flow helium cryostat. The phase transition between the 
high-temperature (4 x 1) and the low-temperature (8 x 2) phase was 
observed at 125 K. LEED images of the (7 x 7), the (4 x 1) and the (8 x 2) 
structures are shown in Extended Data Fig. 2. 


Data analysis 

The LEED pattern of the (8 x 2) phase from Fig. 1a and the cut-outs 
(that is, sections of the pattern) shown in Fig. 1b were recorded ata 
base temperature of T= 60 K (cut-out of the (4 x 1) phase: T=300 kK) 
with an integration time of ¢,,,= 60 s. The diffraction images are 
plotted ona logarithmic colour scale to enhance the visibility of the 
twofold streaks, which are typically one order of magnitude weaker 
than the (8 x 2) spots. The location of the cut-out regions within 
the complete diffraction image is indicated by the white rectangle 
in Fig. la. 

For the analysis of single- and double-pump ULEED experiments, 
we sum up the background-corrected raw data peak intensities within 
circular areas of interest (radius r) around the selected (4 x 1) and (8 x 2) 
spots. To this end, the background is determined within a ring (width 
dr) around the edge of each area of interest. We use radii of r=0.10 A? 
(40 pixels) for the fluence-dependent data presented in Fig. 2a, 
r= 0.08 A (30 pixels) for the data presented in Figs. 2b and 3d anda 
ring width of dr= 0.008 A* (three pixels) for all datasets. The indices 
of the analysed spots are listed in Extended Data Fig. 4b. 

To determine the relative changes in the (4 x 1) and (8 x 2) spot inten- 
sities caused by a single optical pulse (see Fig. 1c), the integrated peak 
intensities for a saturated suppression/enhancement (At, _., = 75 ps) 
are normalized to the value before time-zero. This delay was chosen to 
account for the finite electron-pulse duration under the conditions of 
the experiment (At,_..~50 ps). We consider potential contributions of 
cumulative heating effects by recording the intensities of both (4 x 1) 
and (8 x 2) diffraction spots as a function of the sample base tempera- 
ture T, (see Extended Data Fig. 3a). For the highest relevant value of flu- 
ence (F~1.35 mJ cm”), we find a moderate increase of 7, toamaximum 
temperature of 82 K, which is well below T.. 

The fluence-dependent enhancement/suppression of the (4 x 1)/ 
(8 x 2) signal in the double-pulse experiments (see Fig. 2a) is shown 
relative to the intensity / (At, _..=75 PS, Fy,o30 = 0, Fgo0= 0) without optical 
excitation. Concerning Figs. 2b and 3d, we define the relative switch- 
ing efficiency as 


(Igxa(Aty_p) — Ugx2(Aty_p > Ath_p))) 
{lgx2(Aly_p> Ats_,)) 


E,(Aty_,)=1- (1) 


with At;_,=10 ps (17 ps) in Fig. 2b (3d), respectively. In all cases, Att _,, 
is significantly larger than the temporal overlap of the two optical pulses 
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given by their cross-correlation, andthe damping constant of the coher- 
ent phonon oscillations. 


Fourier analysis 

We use super-Gaussian windows in the time domain to isolate the 
relevant sections in our datasets and reduce numerical artefacts of the 
fast Fourier transform: 


Ei VEN 
(t tsnitt) ) (2) 


Frit,e= &XP | 202 
t 


The values of o, and t,,i¢ used to create the respective figures are as 
follows: Fig. 2b, 0,=3.2 ps, tgnig = 4.5 pS; Fig. 3d, 0,= 4.9 pS, Conige = 6.5 PS; 
Fig. 3e, 0, = 3.5 DS, tpi = 4 PS. To extract the contributions of the 
individual modes to the signal from Fig. 2b, asuper-Gaussian frequency 
window 
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Frit = XP { 203 
is used to filter the relevant frequency range in the fast Fourier trans- 
forms. The data shown in Fig. 2b (bottom) are obtained by an inverse 
Fourier transform of the filtered Fourier transform (shear mode: 
S-= 0.5 THz, o-= 0.10 THz (frequency range 0.37-0.63 THz); rotation 
mode: f, = 0.9 THz, o,= 0.07 THz (frequency range 0.80-0.99 THz); 
d.c.:f,= 0.0 THz, o,= 0.14 THz (frequency range 0-0.19 THz)). Here, f, 
and o;denote the centre frequency and width of the respective Fourier 
window. To study the delay-dependent frequency change of both 
the shear and the rotation mode, we perform a short-time Fourier 
transform of the dataset depicted in Fig. 3d (bottom), again witha 
super-Gaussian window function in the time domain (0,= 3.6 ps (see 
equation (2)), yielding the data shown in Fig. 3d (top). 


Reflectivity and absorption of the indium monolayer 

To relate the reflectivity changes measured by OPP to the absorption 
of the atomic indium wires, we follow ref.“ for the optical properties of 
anultrathin layer ontop ofa dielectric substrate. In our case, the silicon 
substrate has an essentially real (and comparatively large) refractive 
index (n,=3.67 + 0.005i;A=800 nm). For normal incidence, the reflec- 
tion and transmission coefficients ry and ¢, of the bare substrate are 
the standard expressions: 


_(-n), 2 
0" Gen)’ °° (n+” 


(4) 


Thus, the reflected wave is phase-shifted by 180°, and the transmitted 
wave is not phase-shifted. Furthermore, since the sheet conductivity 
o° of amonolayer satisfies |Z,0°| <|n — 1| (where Z is free-space imped- 
ance), the monolayer-induced changes in transmission and reflection 
are both proportional to the real part of o°, as is the absorption A of 
the layer: 
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In other words, owing to the large real and very small imaginary part of 
the substrate refractive index, the imaginary part of the sheet conductiv- 
ity leads to only negligible (quadrature) components in the reflected and 
transmitted waves. The presence of the monolayer results in a ratio of 
reflectance change to layer absorption of AR/A =(n,-1)/(n, +1) =0.57. For 
the mechanism of absorption modulation (active for the rotation mode), 
pump-induced variations of the sheet conductivity 60° by coherent 
phonons will induce variations in reflectance (6R) and absorption (6A) 
following the same ratio 6R/6A = AR/A. Thus, transient reflectivity (OPP) 
directly measures the impact of a specific phonon mode on absorption. 


Relating ULEED and OPP data 

From the above, variations of layer absorption lead to proportional 
changes in reflectance, with a prefactor that depends on the total 
absorption of the monolayer. This allows us to estimate the monolayer 
absorption, assuming absorption modulation as the sole mechanism 
for the rotation mode. At an identical fluence of the first excitation 
(see Fig. 3d, e), for the rotation mode oscillation, we measure relative 
changes in reflectance (6R/R),.,=8 * 10 >and modulations of the ULEED 
intensity 5/of 0.8% (intensity /normalized to value at negative times). 
The steepness of the fluence-dependent intensity (see Fig. 1c) F,, x 
(di/dF);, = 1.7 at the threshold fluence F,, =1 mJ cm’ is used to determine 
the relative changes in absorption of 5A/A = 0.47% via (64/A),o¢= (1/Fiyy,) X 
((dI/dF) lp) 1X (51/1) ,o From these values, we obtain an estimate of the 
total absorption of the monolayer of 
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This value is of the same order as a recent estimate” (0.5%), again indi- 
cating that absorption modulation is a reasonable explanation for the 

rotation-mode contribution to the switching efficiency. 
Inturn, the observed differences in rotation and shear mode ampli- 
tudes between ULEED and OPP are sufficiently pronounced toimplya 
microscopic origin. In addition to possible ballistic contributions, this 


includes atomic-scale sample inhomogeneities such as local variations 
in barrier height. 


Data availability 


The data that support the findings of this study are available on request 
from the corresponding author. 
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Extended Data Fig. 1| ULEED set-up. a, Ultrashort laser pulses (P,:A.=1,030nm, 
At=212 fs) from an Yb:YAG amplifier (left) pump anon-collinear OPA (output: 
A.=400 nm, At=40 fs) and an OPA (output: P;,A.=800 nm, At=232 fs). The 
1,030-nm and 800-nm beams are independently attenuated and collinearly 
focused onto the sample by a single lens (400 mm focal length). The relative 
on-axis position of the two foci is controlled by adjusting the divergence of the 
1,030-nm beam. The ultraviolet pulses are focused onto the tungsten needle 
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emitter inside the electron gun (e -gun) to generate ultrashort electron pulses. 
The relative timing between the electron probe and each of the two optical 
pump pulses is controlled independently by two separate optical delay stages. 
The pump-induced changes inthe LEED pattern are recorded using a 
microchannel plate assembly. b, Cross-correlation of the two pump pulses 
recorded with anonlinear photodiode to determine the temporal resolution of 


the double-pump experiment. 
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Extended Data Fig. 2| Diffraction images. a-c, Diffraction images and 
lineouts of the clean (7 x 7)-reconstructed Si(111) surface (a), the (4 x 1) phase (b) 
and the (8 x 2) phase (c) recorded in our ultrafast LEED set-up (E,;,=130 eV). 
Coloured areas correspond to the unit cells in reciprocal space, arrows indicate 
the location of the lineouts shown below. In the transformation from the (4 x 1) 
to the (8 x 2) phase, the unit cell is doubled in both dimensions. The twofold 


streaks in the diffraction pattern of the (8 x 2) phase originate froma weak 
coupling between the atomic chains. The diffraction patterns of the 
indium-reconstructed phases feature contributions from three domains 
rotated by 120° with respect to each other, as the hexagonal structure of the 
underlying substrate allows for three different orientations of the atomic 
indium chains. 
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Extended Data Fig. 3 | Temperature calibration. a, Temperature-dependent 
integrated intensities of (4 x 1) (top) and (8 x 2) (bottom) diffraction spots 
across the phase transition (7, ~125 K). b, Integrated diffraction spot intensities 
for At,_..< Oin Fig. lc as a function of incident fluence. c, Temperature 
calibration: a Debye-Waller model is fitted to the diffraction spot intensities 


ina for temperatures in the range 60 K < 7<100K. Comparing the suppressions 
inbandc, we find a maximum temperature increase A7,, ~ 22 K for the highest 
fluence value (F,,,,~ 1.35 mJ cm?) within our measurement range. Note that the 
resulting base temperature 7, = 82 K is well below the T.. 
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Pump-probe measurements (Fig.1) and static heating (Extended Data Fig.3) 
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Double-pulse measurements (Fig.2a) 
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Double-pulse measurements (Fig.2b and Fig.3d) 


[(1 3), (3 1), (4 0), (4-1), (4-4), (2-4), (0 -4), 1-3), (2 -2), (-3 -1), 


(-1 
(-4 2), (-3 4), (-2.4), (-2 2), (-3 0), (-2 0), (0 -2), (0 -1), (2 -2), (3 -3)] mua 


(4x1) 


[(3 5), (5 3), (8 -3), (8 -5), (3 -8), (5 -8), (3 -5), (-5 -3), (-8 3), (-8 5), 


(-5 8), (-3 8)] ue 


(8x2) 


Extended Data Fig. 4| Definition of basis vectors and diffraction spot indexing. a, Schematic LEED pattern of the (8 x 2) phase and basis vectors (red) of the 
reciprocal lattice used to index the diffraction spots. b, Complete list of diffraction spots used in analysis. 
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Extended Data Fig. 5 | Optical pump-probe set-up. a, Ultrashort laser pulses The reflected beams pass two short-pass filters (SP) blocking the pump pulses 


(P,:A,=1,030 nm, At=212 fs, ‘Pump’) from an Yb:YAG amplifier (left) pump an and are focused onasilicon photodiode (PD). The relative timing between 
OPA (output: P,,A,=800 nm, At=232 fs, ‘Probe’). The intensity of the pump pump and probe pulses is controlled by an optical delay stage. The 

beam is modulated at a frequency of 25 kHz by an acousto-optic modulator pump-induced reflectivity changes of the sample are measured by processing 
(AOM). Pump and probe beams are independently attenuated and collinearly the PD and reference signals ina lock-in amplifier. RF, radio-frequency; ND, 
focused onto the sample by asingle lens (200-mm focal length). The relative neutral density. 


on-axis position of the two foci can be adjusted using atelescope assembly. 
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Extended Data Fig. 6 | Ultrafast absorption modulation. a, Reflectivity 
change AR/R of the In/Si(111) surface as a function of the time-delay At,_,, 
between pump (1,030 nm) and probe pulses (800 nm; F=0.14 mJ cm”). 
Offsets are added to the datasets for clarity. b, Fourier spectra of AR/R(At,_,,) 
for F=0.04-1.22 mJ cm”, revealing two main coherent contributions 

(f, = 0.65 THz, f, = 0.84 THz for F=0.04 mJ cm”) tothe signals ina, attributed 
tothe symmetric shear and rotation modes. An additional but minor 
lower-frequency contribution to the reflectivity cannot be excluded at this 
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point, given the frequency resolution of the experiment. c, Transient 

(At,_p.* 0.25 ps) and long-lived (At,_,,~ 9 ps) contributions to AR/R as a function 
of pump fluence. The data are normalized to AR/R(At,_,, <0) and the respective 
values for F=2.30 mJ cm”. d, Fluence-dependent frequency shifts of the two 
modes. The rotation mode softens significantly for higher fluences (error bars, 
95% Cl of the fit). e, Normalized Fourier amplitudes of shear and rotation 
modes asa function of fluence. 
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Fgo9 = 0.21 mJ cm”), revealing a pronounced softening/hardening of the 
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Technologies such as batteries, biomaterials and heterogeneous catalysts have 
functions that are defined by mixtures of molecular and mesoscale components. 


As yet, this multi-length-scale complexity cannot be fully captured by atomistic 
simulations, and the design of such materials from first principles is still rare’>. 
Likewise, experimental complexity scales exponentially with the number of variables, 
restricting most searches to narrow areas of materials space. Robots can assist in 
experimental searches® “ but their widespread adoption in materials research is 
challenging because of the diversity of sample types, operations, instruments and 
measurements required. Here we use a mobile robot to search for improved 
photocatalysts for hydrogen production from water”. The robot operated 
autonomously over eight days, performing 688 experiments within a ten-variable 
experimental space, driven by a batched Bayesian search algorithm’*“®. This 
autonomous search identified photocatalyst mixtures that were six times more active 
than the initial formulations, selecting beneficial components and deselecting 
negative ones. Our strategy uses a dexterous””” free-roaming robot” “, automating 
the researcher rather than the instruments. This modular approach could be 
deployed in conventional laboratories for a range of research problems beyond 


photocatalysis. 


The mobile robot platform is shown in Fig. 1a and Extended Data Fig. 1. It 
canmove freely inthe laboratory and locates its position using a combi- 
nation of laser scanning coupled with touch feedback for fine position- 
ing (Methods and Supplementary Video 1). This gave an (x, y) positioning 
precision of 0.12 mm and an orientation precision of 8+ 0.005° within 
astandard laboratory environment with dimensions 7.3 m x 11 m (Fig. 1b; 
Extended Data Fig. 2; Supplementary Figs. 1-10). This precision allows 
the robot to carry out dexterous manipulations at the various stations 
in the laboratory (Fig. 1; Extended Data Fig. 3) that are comparable to 
those performed by human researchers, suchas handling sample vials 
and operating instruments. The robot has human-like dimensions and 
reach (Fig. la, d) and it can therefore operate in a conventional, unmodi- 
fied laboratory. Unlike many automated systems that can dispense only 
liquids, this robot dispenses both insoluble solids and liquid solutions 
with high accuracy and repeatability (Supplementary Figs. 12, 13, 16-20), 
broadening its utility in materials research. Factoring inthe time needed 
to recharge the battery, this robot can operate for up to 21.6 h per day 
with optimal scheduling. The robot uses laser scanning and touch feed- 
back, rather than a vision system. It can therefore operate in complete 
darkness, if needed (Supplementary Video 2), which is an advantage 
when carrying out light-sensitive photochemical reactions, as here. 
The robot arm and the mobile base comply with safety standards for 
collaborative robots, allowing human researchers to work within the 
same physical space (Supplementary Information section 1.5). A video 
of the robot operating an autonomous experiment over a 48-h period 
is shownin Supplementary Video 1. 

The benefits of combining automated experimentation with a 
layer of artificial intelligence (Al) have been demonstrated for flow 


reactors”, photovoltaic films”, organic synthesis* °"*, perovskites”® 


and in formulation problems’. However, so far no approaches have 
integrated mobile robotics with Al for chemical experiments. Here, 
we built Bayesian optimization’® “’ into a mobile robotic workflow to 
conduct photocatalysis experiments within a ten-dimensional space. 
Semiconductor photocatalysts that promote overall water splitting 
to produce both hydrogen and oxygen are still quite rare. For many 
catalysts, a sacrificial hole scavenger is needed to produce hydrogen 
from water, suchas triethylamine (TEA)” or triethanolamine (TEOA)”®, 
and these amines are irreversibly decomposed in the reaction. It has 
proved difficult to find alternative hole scavengers that compete with 
these organic amines”. 

Our objective was to identify bioderived hole scavengers with effi- 
ciencies that match petrochemical amines and that are not irreversibly 
decomposed, with the long-term aim of developing reversible redox 
shuttles. The photocatalyst that we chose was P10, aconjugated poly- 
mer that shows good HERs in the presence of TEOA”. We first used the 
robot to screen 30 candidate hole scavengers (Extended Data Fig. 4). 
This was done using a screening approach, without any Al. Initially, the 
robot loads a solid-dispensing station that weighs any solid compo- 
nents into sample vials (Fig. 1c), in this case the catalyst, P10. Next, the 
vials are transported 16 at a time in a rack to a dual liquid-dispensing 
station (Extended Data Fig. 3c), where the liquid components are added; 
here, 50 gl aqueous solutions of the candidate hole scavengers (Sup- 
plementary Videos 3, 4). The robot then places the vials into a capping 
station, which caps the vials under nitrogen (Supplementary Fig. 21; 
Supplementary Video 5). Optionally, the capped vials are then placed 
into a sonication station (Supplementary Fig. 23; Supplementary 
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Liquids Sonicator “Solids” 
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Fig. 1| Autonomous mobile robot and experimental stations. a, Photograph 
showing robot loading samples into the photolysis station. b, Map of the 
laboratory generated by laser scanning showing positions of the eight stations; 
the orange crosshairs indicate recorded navigation locations and the robot 
position is indicated by the green rectangle. Inputs 1-3 are areas for the storage 
of empty vials or completed sample racks. GC, gas chromatography station. 

c, Robot loading empty sample vials into the solid-dispensing station before 
dispensing the photocatalyst. d, Loading the gas chromatography station with 
anewrack of samples for analysis. e, Storing racks of completed samplesin 
Input Station 1after gas chromatography analysis. 


Video 3) to disperse the solid catalyst in the aqueous phase. The vials 
are thentransported toa photolysis station, where they are illuminated 
with a mixture of ultraviolet and visible light (Fig. 1a; Extended Data 
Fig. 3b; Supplementary Fig. 24; Supplementary Video 6). After photoly- 
sis, the robot transfers the vials to a head space gas chromatography 
station where the gas phase is analysed for hydrogen (Fig. 1d) before 
storage of completed samples (Fig. le). Except for the capping station 
and the photolysis station, which were built specifically for this work- 
flow, the other stations used commercial instruments with no physical 
hardware modifications: the robot operates them in essentially the 
same way that a human researcher would. 

Conditional automation was used in this hole scavenger screen to 
repeat any hits; that is, samples that showed a hydrogen evolution rate 
(HER) of >200 pmol g‘h™ were automatically re-analysed five times. 
Most of the 30 scavengers produced little or no hydrogen (Extended Data 
Fig. 4), except for L-ascorbic acid (256 + 24 pmol g*h) and L-cysteine 
(1,201 + 88 pmol g*h”). Analysis by 'H nuclear magnetic resonance 
(NMR) spectroscopy showed that L-cysteine was cleanly converted to 
L-cystine (Supplementary Fig. 32), indicating that it may have potential 
as a reversible redox shuttle in an overall water splitting scheme”. 

While it showed promise as a hole scavenger, L-cysteine produced 
much less hydrogen than an aqueous solution of TEOA at the same 
gravimetric concentration (2,985 + 103 pmol g ‘hat 50 gI"). We 
therefore sought to increase the HER of the P10/L-cysteine system by 
using an autonomous robotic search based on five hypotheses (Fig. 2a). 
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Fig. 2| Hypothesis-led autonomous search strategy. a, The robot searches 
chemical space to optimize the activity of the photocatalyst + scavenger 
combination according to five separate hypotheses. It does this by 
simultaneously varying the concentration of the ten chemical species shown 
here. b, Plot showing the size of the simplex, or the search space, created witha 
discretization of 19 concentrations for each liquid and 21 concentration levels 
for the solid catalyst, P10, which corresponds to the solid/liquid dispensing 
precision over the constrained space of the experiment. For this 
ten-component problem, the full simplex has 98,423,325 points. 


The first hypothesis was that dye sensitization might improve light 
absorption and hence the HER, as found for the structurally related 
covalent organic framework, FS-COF™. Here, three dyes were investi- 
gated (Rhodamine B, Acid Red 87” and Methylene Blue). Second, we 
hypothesized that pH might influence the catalytic activity (NaOH 
addition). The third hypothesis was that ionic strength could also be 
important® (NaCl addition). Catalyst wettability is known to bea factor 
in photocatalytic hydrogen evolution using conjugated polymers”, 
so the addition of surfactants (sodium dodecyl sulphate, SDS, and 
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Fig. 3 | Output from the autonomous robotic search. a, Plot showing 
hydrogen evolution achieved per experiment in an autonomous search that 
extended over 8 days. Sixteen experiments were performed per batch, along 
with two baseline controls. The baseline hydrogen evolution was 

3.36 + 0.30 pmol (black squares). The maximum rate attained after 688 
experiments was 21.05 pmol h™. The robot made 319 moves between stations 


polyvinylpyrrolidone, PVP) formed our fourth hypothesis. Fifth, we 
speculated that sodium disilicate might act as a hydrogen-bonding 
anchor for the scavenger, L-cysteine, or for the dyes, based on the 
observation that it aids in the absorption of dyes onto the surface of 
carbon nitride™. 

These five hypotheses had the potential to be synergistic or 
anti-synergistic; for example, ionic strength could either enhance or 
decrease dye absorption onto the surface of the photocatalyst. We 
therefore chose to explore all five hypotheses at once. This involved 
the simultaneous variation of the concentration of P10, L-cysteine, 
the three dyes, NaOH, NaCl, the two surfactants, and sodium disili- 
cate, which equates to a ten-variable search space (Fig. 2a). The space 
was constrained by the need to keep a constant liquid volume (5 ml) 
and therefore head space for gas chromatography analysis and by the 
minimal resolution for liquid dispensing module (0.25 ml) and solid 
dispensing module (0.2 mg). 

Problems of this type are defined by a simplex that scales exponen- 
tially with size (Fig. 2b). For this specific search space, there were more 
than 98 million points. Full exploration of such a space is unfeasible, 
so we developed an algorithm that performs Bayesian optimization 
based on Gaussian process regression and parallel search strategy® 
(see Methods). To generate a new batch, we build a surrogate model 
predicting the HER of potential formulations based on the measure- 
ments performed so far and quantify the uncertainty of prediction. 
Subsequent sampling points are chosen using a capitalist acquisition 
strategy, where a portfolio of upper confidence bound functions is 
generated on an exponential distribution of greed to create markets 
of varying risk aversion, which are searched for global maxima. Each 
market is given an agent that searches to return a global maximum, 


16 samples 
64 samples 
80 samples 
112 samples 
142 samples 
188 samples 
488 samples 
663 samples 


NaOH 


Na,Si,O, 


L-cysteine 


Rhodamine B 


and travelled a total distance of 2.17 km during this 8-day experiment. b, Radar 
plot showing the evolution of the average sampling of the search space in 
millilitres; the scale denotes the fraction of maximum solution volume 
dispensed. The starting conditions (Batch 1) were chosen randomly. The best 
catalyst formulation found after 43 batches contained P10 (5 mg), NaOH (6 mg), 
L-cysteine (200 mg) and Na,Si,O; (7.5 mg) in water (5 ml). 


or batch of k-best maxima. The uneven distribution of greed allows 
some suggested points to be highly exploitative, some to be highly 
explorative, and most to be balanced, thus making the strongest use 
of the parallel batch experiments. 

The output from this autonomous robotic search is shown in Fig. 3a. 
The baseline HER for P10 and L-cysteine only (5 mg P10 in5 mlof20gI7 
L-cysteine) was 3.36 + 0.30 pmol h“. Given that the robot would operate 
autonomously over multiple days, this two-component mixture was 
repeated throughout the search (two samples per batch) to check for 
long-term experimental stability (black squares in Fig. 3a). Initially, 
the robot started with random conditions and discovered multicom- 
ponent catalyst formulations that were mostly less active than P10 
and L-cysteine alone (the first 22 experiments in Fig. 3a). The robot 
then discovered that adding NaCl provides a small improvement to 
the HER, validating the hypothesis that ionic strength is important. 
In the same period, the robot found that maximizing both P10 and 
L-cysteine increased the HER. In further experiments (15-100), the 
robot discovered that none of the three dyes or the two surfactants 
improves the HER; indeed, they are all detrimental, counter to our 
first and fourth hypotheses. These five components were therefore 
deselected after around 150 experiments (Fig. 4); that is, after about 
2 days in real experimental time (Fig. 3a). Here, P10 differs from the 
structurally related crystalline fused-sulfone covalent organic frame- 
work (FS-COF), where the addition of Acid Red 87 increased the HER”. 
After 30 experiments, the robot learned that adding sodium disilicate 
improves the HER substantially in the absence of dyes (up to 15 pmol 
after 300 experiments), while deprioritizing the addition of NaOH and 
NaCl. After 688 experiments, which amounted to 8 days of autonomous 
searching, the robot found that the optimum catalyst formulationisa 
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mixture of NaOH, L-cysteine, sodium disilicate and P10, giving a HER of 
21.05 pmol h*, which was six times higher than the starting conditions. 

A number of scientific conclusions can be drawn from these data. 
Increased ionic strength is beneficial for hydrogen production (NaCl 
addition), but not as beneficial as increasing the pH (NaOH/sodium 
disilicate addition), which also increases the ionic strength. We had 
notinvestigated surfactant addition before, but for the two surfactants 
studied here, at least, the effect on catalytic activity is purely negative. 
Intriguingly, the dye sensitization that we observed for a structurally 
similar covalent organic framework, FS-COF”, does not translate to this 
polymer, P10, possibly because the COF is porous whereas P10 is not. 

To explore the dependence of the algorithmic search performance 
onthe random starting conditions, we carried out 100 in silico virtual 
searches, each witha different random starting point, using aregression 
model and random noise to return virtual results (Supplementary Infor- 
mation section 7). Around 160 virtual experiments were needed, on 
average, to find solutions with 95% of the global maximum HER (Fig. 5). 

We estimate that it would have taken a human researcher several 
months to explore these five hypotheses in the same level of detail 
using standard, manual approaches (Supplementary Fig. 31). Man- 
ual hydrogen evolution measurements require about 0.5 days of 
researcher time per experiment (1,000 experiments take 500 days). 
The semi-automated robotic methods that we developed recently® can 
perform 100 experiments per day (a half-day to set up, plus a half-day 
for automatic dispensing and measurement; 1,000 experiments take 
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10 days, of which 5 days are dedicated researcher time). The autono- 
mous robot that we present here also requires half a day to set it up 
initially, but it then runs unattended over multiple days (1,000 experi- 
ments take 0.5 days of researcher time). Hence, the autonomous work- 
flow is 1,000 times faster than manual methods, and at least ten times 
faster than semi-automated but non-autonomous robotic workflows. 
Itis unlikely that a human researcher would have persevered with this 
multivariate experiment using manual approaches given that it might 
have taken 50 experiments or 25 days to locate even a modest enhance- 
ment inthe HER (Fig. 3a). The platform allows us to tackle search spaces 
ofasize that would otherwise be impossible, whichis an advantage for 
problems where our current level of understanding does not allowus 
to reduce the number of candidate components to a more manageable 
number. There were ten components in the example given here, but 
search spaces with up to at least 20 components should be tractable 
with some modifications to the algorithm. 

It took an initial investment of time to build this workflow (approxi- 
mately 2 years), but once operating with a low error rate (Supplemen- 
tary Fig. 38), it can be used as a routine tool. The time required to 
implement this approach in another laboratory would be much shorter, 
since much of the 2-year development timescale involved core proto- 
cols and software that are transferable to other research problems. Also, 
this modular approach to laboratory automation uses instruments in 
a physically unmodified form, so that it will be straightforward to add 
further modules, suchas for NMR or X-ray diffraction, now that the basic 
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Fig. 5| Virtual in silico experiments. Histogram showing the number of 
virtual experiments needed to reach 95% of the optimal HER, as determined by 
carrying out 100 in silico searches, each witha different random starting point. 


principles arein place. This modularity makes our strategy applicable 
toawide range of research problems beyond chemistry. The speed and 
efficiency of the method allow the exploration of large multivariate 
spaces, and the autonomous robot has no confirmation bias”; this 
raises the prospect of emergent function in complex, multi-component 
materials that we could not design in the conventional way. Autono- 
mous mobile robots could also have extra advantages in experiments 
with especially hazardous materials, or where traceability and auditing 
are important, suchas in pharmaceutical processes. 

This approach also has some limitations. For example, the Bayesian 
optimization is blind, in that all components have equal initial impor- 
tance. This robotic search does not capture existing chemical knowl- 
edge, norinclude theory or physical models: there is no computational 
brain. Also, this autonomous system does not at present generate and 
test scientific hypotheses by itself”. In the future, we propose to fuse 
theory and physical models with autonomous searches: for example, 
computed structures and properties’ * could be used to bias searches 
towards components that have a higher likelihood of yielding the 
desired property. This will be important for search spaces with even 
larger numbers of components where purely combinatorial approaches 
may becomeinefficient. To give one example, energy-structure-func- 
tion maps** could be computed for candidate crystalline components 
to provide Boltzmann energy weightings” for calculated properties, 
such as a charge transport or optical gap, to bias the robotic search. 
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Methods 


Robot specifications 

The robot used was a KUKA Mobile Robot mounted ona KUKA Mobile 
Platform base (Fig. 1a; Extended Data Fig. 1). The robot arm has a 
maximum payload of 14 kg and a reach of 820 mm. The KUKA Mobile 
Platform base can carry payloads of up to 200 kg. The robot arm and 
the mobile base have a combined mass of approximately 430 kg. The 
movement velocity of the robot was restricted to 0.5 ms for safety rea- 
sons (section 1.5 of the Supplementary Information,). A multipurpose 
gripper was designed to grasp 10-ml gas chromatograph sample vials, 
solid dispensing cartridges, and a 16-position sample rack (Extended 
Data Fig. 5), thus allowing a single robot to carry out all of the tasks 
required for this workflow. This robot was specified to be a flexible 
platform for a wide range of research tasks beyond those exemplified 
here; for example, the 14 kg payload capacity for the arm is not fully 
used in these experiments (one rack of filled vials has a mass of 580 g), 
but it could allow for manipulations such as opening and closing the 
doors of certain equipment. Likewise, the height and reach of the robot 
allows for operations such as direct loading of samples into the gas 
chromatograph instrument (Fig. 1d). By contrast, asmaller and perhaps 
less expensive robot platform might require an additional, dedicated 
robot arm to accomplish this, or inconvenient modifications to the 
laboratory, such as lowering bench heights. 


Robot navigation 

In a process analogous to simultaneous localization and mapping 
(SLAM)*°, the robot tracks a cloud of possible positions, and updates 
its position to the best fit between the output of its laser scanners and 
the map for each position in the cloud. The position of the robot is 
determined by x and y (its position on the map) and @ (its orientation 
angle). Histograms of the robot position measured over 563 move- 
ments are shown in Supplementary Figs. 2-5, which show that the 
(x, y) positioning precision was better than +10 mm and the orienta- 
tion precision was less than +2.5°, as achieved within a real, working 
laboratory environment. This level of precision allows navigation to the 
various experimental stations in the laboratory, but it does not allow 
fine manipulations, such as placement of sample vials. The precision 
was therefore enhanced by using a touch-sensitive 6-point calibration 
method. Here, the robot touches six points ona cube that is associated 
with each experimental station to find the position and orientation of 
the cube relative to the robot (Supplementary Figs. 7-11). This increased 
the positioning precision to +0.12 mm and the orientation precision to 
+0.00S°. This makes it possible for the robot to operate instruments 
and to carry out delicate manipulations such as vial placements at a 
level of precision that is broadly comparable to a human operator. 


Experimental stations 

The workflow comprised six steps, each with its own station. Solid 
dispensing was carried out with a Quantos QS30 instrument (Mettler 
Toledo) (Fig. 1c; Supplementary Fig. 11; Extended Data Fig. 3a; Sup- 
plementary Video 3). Liquid dispensing was carried out with a bespoke 
system that used a 200 series Mini Peristaltic Pump (Williamson) and 
a PCG 2500-1 scale (Kern), to dispense liquids gravimetrically using a 
feedback loop (section 2.2in Supplementary Information; Supplemen- 
tary Video 5). This system showed excellent precision and accuracy for 
a range of aqueous and non-aqueous liquids over 20,000 dispenses 
(Supplementary Figs. 16, 17, 19, 20). A bespoke instrument was built 
(Labman) to allow both for sample inertization (to exclude oxygen) 
and cap crimping in one step. It would be straightforward to modify 
this platform to allow other gases to be introduced; for example, to 
study photocatalytic CO, reduction. The instrument used caps froma 
vibratory bowl feeder to cap-crimp 10-ml headspace vials (section 2.2 
in Supplementary Information; Supplementary Video 5). If required, a 
sonication station was used to disperse the solid photocatalyst in the 


aqueous solution, before reaction (Supplementary Fig. 23). Photolysis 
was carried out at abespoke photolysis station (Fig. 1a) that uses vibra- 
tion to agitate liquids and a light source that is composed of BL368 
tubes and LED panels (Extended Data Fig. 5b; Supplementary Fig. 24; 
Supplementary Video 6). Gas chromatograph measurements were per- 
formed with a 7890B GC and a 7697A Headspace Sampler from Agilent 
GC (Supplementary Video 3; Extended Data Fig. 3d). The experimental 
stations were controlled by a process management system module, 
which contains all of the process logic for controlling the labware. 
Communication between the process management system and the 
stations was achieved using various communication protocols (TCP/IP 
over WIFI/LAN; RS-232), as detailed in section 2.7 inthe Supplementary 
Information (Supplementary Fig. 28). 


Autonomous search procedure and scheduling 

The robot worked with batches of 16 samples per sample rack and ran 
43 batches (688 experiments) during the search. Of these 688 experi- 
ments, 11 results were discarded because of workflow errors or because 
the system flagged that the oxygen level was too high (faulty vial seal). 
It took, on average, 183 min to prepare and photolyse each batch of 
samples and then 232 min per batchto complete the gas chromatograph 
analysis. The detailed timescales for each of the step in the workflow 
are shown in Extended Data Fig. 6. The work was heuristically sched- 
uled in parallel, with the robot starting the oldest available scheduled 
job. While the robot was working on one job, other instruments, such 
as the solid dispenser, the photolysis station and the gas chromato- 
graph, worked in parallel. This system can process up to six batches 
at once, but given the timescales for this specific workflow, where the 
preparation/reaction time is approximately equal to the analysis time, 
the robot processed two batches simultaneously. That is, it prepared 
samples and ran photolysis for one batch while analysing the hydrogen 
produced for the second batch using the gas chromatograph. The robot 
recharged its battery automatically in between two jobs when the bat- 
tery charge reached a 25% threshold. The robot was charged but idle 
for approximately 32% of the time in this experiment, largely because 
of time spent waiting for the gas chromatograph analysis, whichis the 
slow step. In principle, this time could be used to run other experiments 
in parallel. The autonomous workflow was programmed to alert the 
operator automatically when the system is out of stock (if, for example, 
it ran out of sample vials or stock solutions were low), or ifa part of the 
workflow failed (section 8 of the Supplementary Information). Most 
errors could be reset remotely without being in the laboratory because 
all stations were equipped with 24/7 closed-circuit television cameras 
(Supplementary Fig. 39). 


Bayesian search algorithm 

The Al guidance for the autonomous mobile robot was a batched, con- 
strained, discrete Bayesian optimization algorithm. Traditionally, 
Bayesian optimization is a serial algorithm tasked with finding the 
global maximum of an unknown objective function”. Here, this equates 
to finding the optimal set of concentrations ina multicomponent mix- 
ture for photocatalytic hydrogen generation. The algorithm builds a 
model that can be updated and queried for the most promising points 
to inform subsequent experiments. This surrogate model is con- 
structed by first choosing a functional prior Porior), informed by exist- 
ing chemical knowledge (if any). Given data D and a likelihood model 
Piixectinood>|) » this yields a posterior distribution of models using 
Bayes’ theorem: 


Piiketinood’P 1A) Pprior() 
PD) 


(O|D) = (1) 


Posterior 


The Gaussian process prior used a Matern similarity kernel, constant 
scaling and homoscedastic noise”. This composite kernel allows for var- 
iable smoothness, catalytic activity and experimental noise. The form 


and respective hyperparameters were refined using cross-validation 
on other, historical photocatalysis datasets (350 experiments). Other 
alternatives for a functional prior included Bayesian neural networks”; 
but Gaussian processes were selected here for robustness and flexibil- 
ity”. An acquisition function, dycg, was assembled from the posterior 
distribution by considering the posterior mean, (x), and uncertainty, 
o(x). The maximum of this function was then used as the next suggested 
experiment. To balance exploitation (prioritizing areas where the mean 
is expected to be largest) and exploration (prioritizing areas where the 
model is most uncertain), we used an upper confidence bound that 
is dependent ona single hyperparameter, £, to govern how ‘greedy’ 
(exploitative) the search is: 


Qucp(X ; D) : = u(x) + Bo(x) (2) 


The portfolio of acquisition functions for different values of B, 
which we call markets, was used to generate a batch. This ‘capitalist’ 
approach has the advantage of simple parallelization and is robust 
across variable batch sizes*. Our method allowed us to constrain 
the sum of all liquid components to 5 ml to allow a constant gas 
headspace volume for gas chromatograph analysis. The sum total 
volume constraint was handled during the market searches; dis- 
cretization, which was determined by instrument resolution, was 
handled after the market searches. The market search was completed 
using a large initial random sampling followed by a batch of seeded 
local maximizations using a sequential least-squares programming 
(SLSQP) algorithm as implemented in the scipy.optimize package. 
This maximization occurs in a continuous space, and the results 
are placed into discrete bins following the experimental precision. 
The explored space is tracked as a continuous variable for model 
building and asa discrete variable for acquisition function maximi- 
zation. The algorithm was implemented using the scikit-learn and 
in scipy packages*. 


Materials and synthetic procedures 

The polymeric photocatalyst P10 was synthesized and purified accord- 
ing toa modification onaliterature procedure“ (section 10 of the Sup- 
plementary Information). For solid dispensing, the polymer was ground 
with mortar and pestle before use. Sodium disilicate was obtained as a 
free sample from Silmaco. Tap water was purified with PURELAB Ultra 
System. All other materials were purchased from Sigma-Aldrich and 
used as received. 


Data availability 


The implementation of the liquid-dispensing station, photolysis sta- 
tion and the workflow, along with three-dimensional designs for lab- 
ware developed in the project, are available at https://bitbucket.org/ 
ben_burger/kuka_workflow, the code for the robot at and the Bayes- 
ian optimizer is available at https://github.com/Taurnist/kuka_work- 
flow_tantalus and https://github.com/CooperComputationalCaucus/ 
kuka_optimizer. Additional design details can be obtained from the 
authors upon request. 
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Extended Data Fig. 1| Mobile robotic chemist. The mobile robot used for this 
project, shown here performing a six-point calibration with respect to the black 
location cube that is attached to the bench, in this case associated with the 
solid cartridge station (see also Supplementary Fig. 11 and Extended Data 

Fig. 3a). 
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Extended Data Fig. 2| Laboratory space used for the autonomous experiments. The key locations in the workflow are labelled. Other than the black location 
cubes that are fixed to the benches to allow positioning (see also Extended Data Fig. 1), the laboratory is otherwise unmodified. 
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Extended Data Fig. 3 | Stations in the workflow. a, Photograph showing the 
robot at the solid dispensing / cartridge station. The two cartridge hotels can 
hold up to 20 different solids; here, four cartridges are located in the hotel on 
the left. The door of the Quantos dispenser is opened using custom workflow 
software that interfaces with the command software that is supplied with the 
instrument before loading the correct solid dispensing cartridge into the 
instrument (Supplementary Video 3). Since the KUKA Mobile Robotis 
free-roaming and has an820 mm reach, it would be simple to extend this 
modular approach to hundreds or even thousands of different solids given 
sufficient laboratory space. b, Photograph showing the KUKA Mobile Robot at 


calibration cube 


the photolysis station (see also Supplementary Videos 3, 6).c, Photograph 
showing the KUKA Mobile Robot at the combined liquid handling/capping 
station. The robot can reach both the liquid stations and the Liverpool 
Inertization Capper-Crimper (LICC) station after six-point positioning, such 
that liquid addition, headspace inertization and capping can be carried outina 
single coordinated process (see Supplementary Videos 3, 5), without any 
position recalibration. d, Photograph of the KUKA Mobile Robot parked at the 
headspace gas chromatography (GC) station. The gas chromatography 
instrumentis a standard commercial instrument and was unmodified in this 
workflow. 
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triethanolamine. Scavengers are labelled with the concentration of the stock 


solution that was used (5 ml volume 


standard deviation. 


Extended Data Fig. 4| Hydrogen evolution rates for candidate bioderived 
sacrificial hole scavengers. Results of a robotic screen for sacrificial hole 


5mg P10). The error bars showthe 


, 


scavengers using the mobile robot workflow. Of the 30 bioderived molecules 
trialed, only cysteine was found to compete with the petrochemical amine, 
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" 
Extended Data Fig. 5 | Multipurpose gripper used in the workflow. The e, gripper holding a full sample rack using an outwards grasp that locks into 
gripper is shown grasping various objects. a, The empty gripper; b, gripper recesses in the rack. The same gripper was also used to activate the gas 
holding acapped sample vial (top grasp); c, gripper holding an uncapped chromatography instrument using a physical button press (see Supplementary 


sample vial (side grasp); d, gripper holding a solid-dispensing cartridge; and Video 3;1min 52s). 


Full workflow 
481 minutes 


Sample preparation GC analysis 
184 minutes 232 minutes 


Liquid dispensing and 


capping 
59 minutes 


Photolysis GC analysis 
88 minutes 232 minutes 


Solid dispensing 
35 minutes 


Extended Data Fig. 6 | Timescales for steps in the workflow. Average taken for the loading and unloading steps (for example, the photolysis time 
timescales for the various steps in the workflow (sample preparation, itself was 60 min; loading and unloading takes an average of 28 min per batch). 
photolysis and analysis) for a batch of 16 experiments. These averages were The slowest step in the workflow is the gas chromatography analysis. 


calculated over 46 separate batches. These average times include the time 
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Enhanced silicate rock weathering (ERW), deployable with croplands, has potential 
use for atmospheric carbon dioxide (CO,) removal (CDR), whichis now necessary to 
mitigate anthropogenic climate change’. ERW also has possible co-benefits for 
improved food and soil security, and reduced ocean acidification* *. Here we use an 
integrated performance modelling approach to make an initial techno-economic 
assessment for 2050, quantifying how CDR potential and costs vary among nationsin 
relation to business-as-usual energy policies and policies consistent with limiting 
future warming to 2 degrees Celsius®. China, India, the USA and Brazil have great 
potential to help achieve average global CDR goals of 0.5 to 2 gigatonnes of carbon 
dioxide (CO,) per year with extraction costs of approximately US$80-180 per tonne 
of CO,. These goals and costs are robust, regardless of future energy policies. 
Deployment within existing croplands offers opportunities to align agriculture and 
climate policy. However, success will depend upon overcoming political and social 
inertia to develop regulatory and incentive frameworks. We discuss the challenges 
and opportunities of ERW deployment, including the potential for excess industrial 
silicate materials (basalt mine overburden, concrete, and iron and steel slag) to 
obviate the need for new mining, as well as uncertainties in soil weathering rates and 
land-ocean transfer of weathered products. 


The failure of the world to curb fossil fuel CO, emissions®, and the 
inadequacy of planned mitigation measures’, has been greeted with 
growing public consternation® consistent with the intergenerational 
injustice of human-caused climate change’. Even the most ambitious 
emission phase-outs”” fail to achieve the United Nations Framework 
Convention on Climate Change Paris Agreement targets for limiting 
global warming without the help of massive amounts of atmospheric 
CDR. Extraction goals’”*” later this century in most studies are on the 
order of at least 10 gigatonnes of CO, per year (Gt CO, yr’), although 
projections of rapid technological change’ suggest a lower requirement 
of 2-2.5 Gt CO, yr“. This formidable challenge has led to international 
calls for urgent research into a portfolio of CDR options to understand 
their feasibility, scope, costs and challenges”. 

Our focus is terrestrial ERW, a CDR strategy based on amending soils 
with crushed calcium- and magnesium-rich silicate rocks to accel- 
erate CO, sequestration? +”, Basalt, an abundant fast-weathering 


rock with the required mineral chemistry, could be ideal for imple- 
menting land-based ERW because of its potential co-benefits for crop 
production® and soil health” *. ERW liberates base cations, generating 
alkalinity, so that atmospheric CO, is converted into dissolved inorganic 
carbon (principally hydrogen carbonate ions; HCO, ) that is removed 
via soil drainage waters. These weathering products are transported 
via land surface runoff to the oceans with a storage lifetime exceed- 
ing 100,000 years”. Depending on soil type and pH, atmospheric 
CO,-derived dissolved inorganic carbon may also be sequestered 
through the formation of soil carbonate minerals, which reduces the 
efficiency of carbon sequestration by approximately half”. The logis- 
tical infrastructure to apply basaltic rock dust to managed croplands 
already exists owing to the common need to apply crushed limestone 
to reverse soil acidification resulting from intensive cropping? *. 
Thus, rapid deployment at large scale appears to be feasible within 
decades, and has important ancillary benefits including mitigation of 
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ocean acidification ®. Carbon sequestration by ERW on croplands— 
a biogeochemical CDR option supporting multiple United Nations 
sustainable development goals and ecosystem services*”°, anda prag- 
matic land-use choice to maximize scalability and co-benefits—thus 
warrants detailed examination. 

We constructed a performance model with sub-national level of 
detail to assess quantitatively the CDR capacity and costs for land-based 
ERW implementation in major economies, constrained by available 
agricultural land area and energy production (including USA, India, 
China, Brazil and Europe) (Extended Data Fig. 1). For rock weathering 
within the soil profile, we developed a one-dimensional vertical reac- 
tive transport model with steady-state flow, and a source term repre- 
senting rock grain dissolution (Methods; Supplementary Figs. 1-12; 
Supplementary Tables 1-5). Our work builds on advances made in 
prior ERW research largely on tropical forested ecosystems>77)72, 
with the practical aims of understanding the capacity of agriculture to 
capture carbon via soil amendment with milled basalt. For this initial 
nation-by-nation assessment, we examine the sensitivity of net CDR 
oncurrent croplands to the projected national energy production for 
2050 under a business-as-usual (BAU) energy scenario based on ongo- 
ing energy transitions>. This is compared with a 2 °C scenario (that is, 
ascenario in which the increase in global mean temperature since the 
pre-industrial period is limited to 2 °C by 2050), which includes a wide 
range of policy measures designed to respect the 2 °C target with 75% 
probability’ (Supplementary Tables 6-12). 


CDR potential via ERW 


Our geospatial analyses define a new technical potential CDR range 
for those nations with high capacity for ERW deployment on cropland 
(Fig. 1; Supplementary Figs. 13-15). For each nation, we generate CO, 
capture curves by ranking CDR potential from the highest to the low- 
est grid cells with increasing ERW deployment. National median CDR 
curves typically show CDR capacity rising with increasing cropland area, 


— BAU — 2°C b c 
USA 


Fig.1|CDR via ERW with croplands. 
1.25) India Net CDR curves for nations with the 
1.00 highest CDR potential worldwide (a-g) 
0.75 and in Europe (h-I) asa function of 
0.50 increasing ERW deployment across 
0.25 existing croplands. Note the y-axis scale 
0.00 changes. Results are shown for the BAU 


0.100.25 0.50 0.75 1.00 andthe2°C energy policy scenarios. 
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0.15) Canada a represents the 90% confidence interval 


calculated for basalts with relatively 
slow- versus fast-weathering rates for 
the BAU scenario; short green dashed 
lines indicate the 90% confidence limits 
of the corresponding 2 °C scenario 
simulations. Uncertainty innet CDR 
increases as ERW deploys onto 
croplands occupying a wider range of 
environmental conditions. 
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with CDR by silicate soil amendment reaching a plateau, or declining 
inthe case of Canada (Fig. 1). These patterns reflect expansion of ERW 
into climatically unfavourable agricultural land, causing CDR potential 
toslow relative to the carbon penalty of logistical operations, and our 
assumed 3% limit in national energy available for grinding (see Methods 
section ‘Cost assessments’ and Extended Data Fig. 2). Overall trendsin 
national CDR curves are relatively insensitive to the choice of energy 
scenario. Chinais the exception because its large increase in low-carbon 
energy usage projected under the 2 °C scenario’ allows net CDR to rise 
by substantially reducing secondary CO, emissions from logistical 
operations (Fig. 1). This contrasts with results for India, whose total 
energy production falls by around 40% with atransition to low-carbon 
energy production in the 2 °C scenario, lowering the energy available 
for grinding basalt, and thus the potential for increased CDR by ERW. 
Reductions in energy production for other nations inthe 2 °C scenario 
compared with the BAU scenario similarly lower their potential for 
increased CDR with the transition to low-carbon energy. 
Recognizing the urgent need to assess large-scale options for meet- 
ing near-term CDR goals”, we determine the potential contribution of 
nations to achieve CDR goals across the 0.5-2 Gt CO, yr ‘range (Table 1; 
Extended Data Fig. 3). Overall, we find that the three countries with 
the highest CDR potential are coincidentally the highest fossil fuel 
CO, emitters (China, USA and India)° (Fig. 1). Indonesia and Brazil, 
with CO, emissions 10-20 times lower than the USA and China, have 
relatively high CDR potential owing to their extensive agricultural 
lands and warm, seasonally wet climates conducive to high silicate 
rock weathering efficiency. European countries have a CDR potential 
an order of magnitude lower than those of China, USA and India, mainly 
because of less agricultural land area. The five European nations with 
the highest net CDR potential could offset 30% of the current emis- 
sions of European nations in 2019 and the three European countries 
with the highest CDR potential are also the largest European emit- 
ters of CO, from fossil fuels (Germany, Spain and Poland)°. Our ERW 
scenarios (Table 1) correspond to an aggregate CDR of 25-100 Gt CO, 


Nature | Vol 583 | 9 July 2020 | 243 


Article 


if sustained over five decades. This would save up to 12% of the remain- 
ing cumulative carbon emission budget (about 800 Gt CO,) that gives 


a 66% probability of limiting global warming to below 2 °C above the woe pe aa send jean = 
pre-industrial average surface temperature”. (GtCO, yr") (Gtyr”) 

Inthe context of our CDR goals, ERW hasa potentialsimilartothatof @5¢tco,yr" 
other CDR strategies” estimated for 2050, including bio-energy with World China 10 013 O77 1021 
carbon capture and storage (BECCS), widely adopted in IPCC future en ii aH ae coe 
scenarios (0.5-5 Gt CO, yr), direct air capture and storage (DAC) - 
(0.5-5 Gt CO, yr“), biochar (0.5-2 Gt CO, yr), soil organic carbon nale H on ed as 
sequestration (0.5-5 Gt CO, yr"), and afforestation/reforestation Brazil ie oO) Dee a8 
(0.5-3.6 Gt CO, yr). One benefit of country-level analysis for CDR is Indonesia 10 0.017 0.091 54.3 
the scope for comparative assessments with other technologies and Canada 10 0.022 0.13 177.6 
opportunities for co-deployment. For example, our ERW CDR rangeis Mexico 10 0.013 0.073 97.5 
comparable with large-scale implementation of BECCSinthe USA by Europe __ France 10 0.017 0.085 ~—- 158.1 
2040 (0.3-0.6 Gt CO, yr“), as constrained by biomass productivity, Germany 11 0.012 0.066 1678 
location and capacity of CO, storage sites*. ERW avoids competition Italy "1 0.0070 0.039 181.9 
for land used in food production, and related increased demands of Spain 10 0.012 0.066 1928 
BECCS for freshwater and polluting fertilizers , with CO, being treated eg re 6 BODES GOsO = nS 
as a resource for mineral weathering. Co-deployment of ERW with = 
feedstock crops for BECCS and biochar could enhance the feasibility pe Oe O2TE 
and carbon sequestration potential of these strategies*”°. World” -Chinia #3 pao i oe 

Inorganic carbon sequestration by ERW appears to be comparable USA 24 0.21 1.26 168.5 
to soil organic carbon sequestration, another proposed CDR strategy India 23 0.24 1.50 79.9 
(about 2.5 Gt CO, yr by 2100)” using agricultural land, but with poten- Brazil 23 0.083 0.45 116.4 
tially greater long-term security of carbon storage. Co-deployment of Indonesia 25 0.033 0.18 575 
ERW andsoil organic carbon sequestration at large scale might, there- Canada 16 0.030 0.20 1917 
fore, contribute substantially to the 5 Gt CO, yr’ CDR goal suggested Mexico 33 0.025 015 1031 
in decarbonization scenarios” for 2050. Compatibility of ERW and soil Europe: Francs A 0.034 017 160.4 
organic carbon sequestration may be realistic given that amendment Sendiy = OnE Gia awe 
of acidic organic-rich soils with silicate minerals, and the resultant 
pH increase, had no effect on respiratory CO, fluxes”*”’, contrary to al a 2008 Dees le 
concerns that increased soil pH may accelerate organic matter decom- Spain A? ee Gg oy 
position>°. However, efficacy of CDR, sink saturation, and permanency Poland 7 0.012 0.081 170.9 
of storage with these approaches, separately and interactively, are 1.5GtCO, yr" 
uncertain””*. Abatement of soil N,O emissions by basalt application World — China 38 0.40 2.48 114.5 
to conventionally managed arable and perennial crops”, and of N,O USA 39 0.32 1.99 173. 
and CH, emissions by application of artificial silicates to rice agricul- India 36 0.37 2.35 80.2 
ture”, is possible. Such effects would further lower adverse impacts of Brazil 36 013 ov N05 
agriculture on climate per unit yield, amplifying the climate mitigation indonesia a 0.050 0.28 586 
potential of ERW. ne . : hod Canada 25 0.045 0.35 207.3 

Greenhouse gas emissions reductions aimed at limiting future warm- - 
ing are defined under the Paris Agreement by Nationally Determined Mexico add 0.038 0128 Agee 
Contributions (NDCs). As yet, most of the top ten fossilcarbonemit- Europe France 38 9.050 0.26 159.5 
ting nations are failing to meet their 2030 NDC pledges which, even if Germany 39 0.037 0.20 173.6 
met, imply a median global warming (2.6-3.1 °C) exceeding the Paris Italy 37 0.021 0.13 194.1 
agreement”. Warming of this magnitude could allow the Earthsystem Spain 28 0.026 0.17 189.3 
to cross thresholds for irreversible planetary heating and long-term Poland 27 0.019 0.13 171.3 
multi-metre sea-level rise, with potentially disastrous consequences 2.06tCo, yr‘ 
for coastal cities*. NDC pledged carbonemission reductionsundergo World China 55 053 3.46 1207 
periodic revision in response to trends in greenhouse gas emissions, USA 55 0.42 272 167 
uptake of low-carbon energy technology, and climate® and hence are : 

; ; : India 51 0.49 3.30 80.9 

not set for 2050. We therefore illustrate the potential for undertaking - 
ERW with agricultural lands to strengthen near-term national 2030 Bea a ou aii gee 
NDCs (Fig. 2). Indonesia 59 0.067 0.38 59.4 

Results show that China may be able to augment its pledged 2030 Canada 35 0.060 0.51 220.3 
NDCs by about 5% to 10%, with similar gains for the USA, which has Mexico 52 0.050 0.33 106.8 
opted out of the Paris agreement. For India, the gain rises to 40% of Europe ‘France 54 0.067 0.36 1571 
its current pledged emissions, and Brazil may be able to offset 100% Germany 57 0.050 0.28 175.9 
of its pledged 2030 CO, emissions plus some fraction of those from Italy 55 0.029 018 193.3 
other countries (Fig. 2). Other countries outside Europe considered Spain M 0.035 0.25 1907 
in our analysis (Indonesia, Canada, Mexico) may be able to augment Poland 38 0.025 019 754 


their NDCs by up to 30% (Fig. 2). In Europe, ERW could aid substantial 


Table 1| CDR goals for ERW with croplands in 2050 


All values are means of both energy scenarios; see main text for details. For each country, c, 
we assigned its contribution, CDR(c), to a global CDR goal as follows, where CDR,,,,(c) is the 


maximum CDR value attainable by a country: CDR(c) = CDR goatg— ene. 
countries max’ 


decarbonization of France and Spain (up to approximately 40%), andto 
alesser extent Poland, Italy and Germany (all about 10%) (Fig. 2). ERW, 
therefore, may have a roleto play in compensating for residual carbon 
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emissions from sectors recognized as being difficult to decarbonize, 
for example, transportation by aviation, shipping and agriculture’. 


Costs of CDR via ERW 


Cost assessment is needed to evaluate commercial feasibility of ERW 
and to puta price onclimate mitigation actions (Extended Data Fig. 4). 
Our cost estimates based on current prices (2019 US dollars, through- 
out) fall within the range of prior ERW assessments (US$75-250 per 
tonne of CO,)7"” while resolving differences among nations (Fig. 3; 
Table 1; Supplementary Figs. 16-25; Supplementary Tables 13 and 14). 
Average costs in USA (US$160-180 per tonne of CO,), Canada and Euro- 
pean nations (US$160-190 per tonne of CO,) are almost 50% higher 
than those in China, India, Mexico, Indonesia and Brazil (US$55-120 
per tonne of CO,). The difference largely reflects labour, diesel and 
electricity costs. 

Defined as the cost of CDR and storage, the price of carbon is a 
proposed economic enabler for bringing CDR strategies to market”. 
Carbon price is forecast by the World Bank" to reach US$100-150 per 
tonne of CO, by 2050. Costs per tonne of CO, removed by ERW are 
generally within this projected carbon price range in all nations, 
but unit costs increase when cropland area exceeds the optimal 
fraction, because the efficiency of weathering and CDR falls (Fig. 3; 
Table 1). A carbon price of US$100-150 per tonne of CO, would cover 
most of the ERW costs for the key nations reported here. It would 
make ERW an economically attractive option for fast-growing 
nations, suchas India, China, Indonesia, Brazil and Mexico, given their 
estimated CO, extraction costs of around US$75-100 per tonne of CO, 
(Fig. 3; Table 1). 

Our estimated ERW costs of CDR for nations are comparable to esti- 
mates summarized for BECCS (US$100-200 per tonne of CO,), direct 
air capture and storage (US$100-300 per tonne of CO,), and biochar 
(US$30-US$120 per tonne of CO,), but higher than estimates for soil 


Fig. 2 | Augmentation of pledged CO, 
emissions reduction by ERW. Fraction 
of 2030 NDC emissions reductions by 
enhanced weathering for nations with 
the highest CDR potential worldwide 
(a-g) and in Europe (h-I), asa function 


.00 of increasing ERW deployment across 
0.100.25 0.50 0.75 1.00 croplands. Note the y-axis scale 
changes. Results are shown for the BAU 
0.60) Canada energy policy and the 2 °C energy 


policy scenarios. The grey-shaded area 
for each nation represents the 90% 
confidence interval calculated for 
basalts with relatively slow- versus 
fast-weathering rates for the BAU 
scenario; short green dashed lines 
indicate the 90% confidence limits of 
the corresponding 2 °C scenario 
simulations. 
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organic carbon sequestration (US$0-10 per tonne of CO,)”. Affores- 
tation/reforestation and practices that increase soil carbonin natural 
ecosystems, including wetland restoration, have lower estimated costs 
(<US$100 per tonne of CO,)”. However, these natural carbon seques- 
tration options require assessment for possible indirect unintended 
positive climate feedbacks. 

Per capita metrics help to conceptualize the matter of costs interms 
relevant to citizens. Current fossil fuel emissions per person per year® 
are 16.5 t CO, (USA), 15.1t CO, (Canada), 7.5 t CO, (China), 7.3 t CO, 
(Europe), 2.6 t CO, (Brazil), 1.8 t CO, (Indonesia) and 1.7 t CO, (India). 
ERW cannot offset all fossil fuel emissions, but using its cost as a guide, 
the per capita annual cost of achieving zero net emissions, a goal for 
decarbonization, would be highest for Canada (US$3,004), the USA 
(US$2,780), China (US$832) and Europe (US$1,288). Costs fall sub- 
stantially for citizens in Brazil (US$300), Indonesia (US$103) and India 
(US$135) (Table 1). 

At this early stage of research and development, costs are uncertain 
and ERW is in need of demonstration projects’"”. Costs are likely to 
decline asthe market expands and technologies develop. This includes 
emergence of more energy-efficient, low-carbon technologies for 
rock grinding. Costs may also decline via co-deployment with affor- 
estation/reforestation projects or agroforestry as part of worldwide 
carbon-offset trading schemes’. The net cost of ERW may be lower, 
given that rock dust is anacceptable fertilizer for organic agriculture, 
which currently occupies 57.8 million hectares, because it adds eco- 
nomic value by improving soil health, fertility and ecosystem services®. 


Implementation challenges and opportunities 

Our analysis of the techno-economic potential for CDR via ERW 
strengthens the case for evaluating all aspects of practical deploy- 
ment in developed and developing economies. This includes: meeting 
rock demand through alternative sources that avoid mining expansion; 
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Fig. 3| Costs of carbon extraction via ERW with croplands. Costs of CO, 
extraction from air by ERW for nations with the highest CDR potential 
worldwide (a-g) and in Europe (h-]), asa function of increasing ERW 
deployment across croplands. Results are shown for the BAU and the 2 °C 


undertaking a more complete economic valuation; and engaging the 
public to understand social acceptance. 

National demand for crushed silicate rock is dependent onthe extent 
of ERW deployment (Extended Data Fig. 5). Within our scenarios, the 
demand for basalt required for ERW rises with an increasing CDR goal 
and scales with agricultural land area (Table 1). Safeguarding against 
substantial increased mining and possible adverse impacts on inhabit- 
ants* requires exploiting underutilized stockpiles of crushed basalt 
produced asa by-product of the aggregate industry. Mining generates 
acontinuous, but often discarded, finely powdered silicate by-product 
that is utilizable for ERW. These materials have been accumulating 
worldwide for decades and may require no further grinding, thus lower- 
ing CO, emissions that reduce CDR efficiency (Extended Data Fig. 6)”"~°. 
However, national inventories of the location, availability and extent 
of this resource are required to assess the potential contribution of 
this resource to CDR via ERW. 

Requirement for mining may be further reduced by using artificial sil- 
icate by-products from industrial processes””’, including calcium-rich 
silicates produced by iron and steel manufacturing (slags) witha long 
history of agricultural usage*”’. This material is recycled as low-value 
aggregate (less than about US$5 per tonne), and often stockpiled at 
production sites or disposed of in landfills, whereas it could become 
a valuable commodity for CDR. The largest amounts of by-products 
from the construction and demolition industry are cement, sand and 
masonry. Following separation from other materials (for example, met- 
als and plastics), the cement comprises relatively ‘clean’ calcium-rich 
silicates and may be suitable for application to soils, but this suggestion 
requires field trials to assess suitability. Cement contributes about 6% 


246 | Nature | Vol583 | 9 July 2020 


Fractional cropland area 


0 
0.10025 050 0.75 1.00 
Fractional cropland area 


0.50 0.75 1.00 


energy policy scenarios. The grey-shaded area for each nation represents the 
90% confidence interval calculated for basalts with relatively slow- versus 
fast-weathering rates for the BAU scenario; short green dashed lines indicate 
the 90% confidence limits of the corresponding 2 °C scenario simulations. 


to global CO, emissions® and ERW may represent a land management 
option for valuing the by-products of cement and improving the sus- 
tainability of this worldwide industry. 

We forecast production of artificial calcium-rich cements for con- 
struction and by-product slag from steel manufacturing for Brazil, 
China, India and the USA, to understand their potential role in meeting 
silicate demand for ERW (Fig. 4). Differences between national produc- 
tion estimates are driven by forecast population increases over the 
coming century, and per capita consumption trends for the material 
under the middle-of-the-road Shared Socioeconomic Pathway (see 
Methods). Bulk silicate production from the construction and demo- 
lition sector is modelled to increase substantially in all four nations, 
with China and India having a combined production by 2060 of about 
13 Gt yr? (Fig. 4). China and India dominate, with above-average per 
capita cement use compared to the global average, and substantially 
larger populations than the USA and Brazil®®. Thus, bulk silicate produc- 
tion of these two nations could meet the demand for ERW with large 
CDR potential (Table 1). Although chemically similar to basalt, these 
artificial calcium-rich silicates contain minerals that dissolve several 
orders of magnitude faster, react rapidly with CO, in soils under ambi- 
ent conditions*®, and are produced in fine particle sizes that facilitate 
accelerated weathering”. 

Agricultural production could benefit substantially from increased 
resource use efficiency, reducing consumption of raw materials and 
recovering mineral nutrients from silicate by-products”**? and from 
legacy reserves of silicate rock dust**. However, application of any 
silicate material to agricultural soils requires careful assessment of 
the risks, including potential release of metals and persistent organic 


compounds (Supplementary Table 15). Undertaking ERW practices 
with these materials addresses a critical need to fertilize soils with 
silica and other nutrients lost by harvesting that gradually depletes 
plant-available pools®. Intensification of food production across 
24 million hectares of productive agricultural land in South Asia and 
China, for example, is creating acidified, desilicated soils exhausted 
in plant nutrients (potassium, zinc and available phosphorus) that 
limit yields’. Yet these negative effects may be reversible with ERW 
treatments such as fertilization of irrigated rice using either natural 
and/or artificial silicates (for example, recycled steel slags). Such treat- 
ments replenish plant available silica pools, increasing yields and soil 
pH, and decrease the mobility of potentially toxic trace elements (for 
example, arsenic)**. ERW may therefore also havea rolein remediation 
of toxic-metal-contaminated soils and sediments across 20 million 
hectares of cultivated land in southern China and elsewhere”. 

More broadly, innovative ERW practices via soil amendments with 
targeted silicate minerals could help to rebuild rapidly deteriorating agri- 
cultural soils on which over six billion people depend directly for food**®. 
Such practices may complement other approaches to soil improve- 
ment, including conservation tillage and nitrogen-fixing cover crops. The 
current substantial rate of agricultural top-soil depletion requires urgent 
remedial action, with high economic costs apparent already in China, 
where degradation of soils supporting wheat, maize and rice production 
costs an estimated US$12 billion annually**. Targeted amendment of 
agricultural soils for CDR may have a role in slowing rates of soil loss by 
upto 45%, with the accelerated weathering of added minerals replacing 
inorganic nutrients and the resultant formation of clays and mineral 
organic aggregates increasing the cation exchange capacity and water 
storage capacity of rebuilt soils*”°. The addition of trace amounts of 
zinc and iron could also improve public health by reversing the effect 
of rising CO, levels on the declining nutritional value of food crops”. 

The feasibility of mobilizing millions of smallholder communities 
to adopt ERW practices in China and India will depend both on dem- 
onstrating that soil improvements can reverse yield declines and on 
government subsidies. Farming practices adopted for increasing sus- 
tainable productivity, for example, have transformed agriculture across 
37 million hectares in China, increasing profits by US$12.2 billion over 
a decade™. With 2.5 billion smallholders farming 60% of the world’s 
arable land, a similar outreach programme could be used through- 
out Asia, with farmers earning more profits from higher yields while 
sequestering CO,. Involving local scientists in conducting research 
into its effectiveness and safety to build trust and engagement with 
smallholder farmers is key, alongside involvement with policymakers 
and stakeholders. This increases the potential to bring smallholders 
out of extreme poverty and, in the regions with climates suitable for 
non-irrigated agriculture, restore highly degraded soils not suitable 
currently for food production. 

Realizing the potential of ERW as a biogeochemical approach to 
sequester CO, by altering land management practices will depend on 
the commitment of farmers and governments, implementation of the 
right policy frameworks and wider public acceptance. Understand- 
ing the balance between positive and negative outcomes, in terms of 
public acceptance of the inevitable trade-offs between local mining 
activities versus global sequestered carbon, requires empirical test- 
ing with stakeholders and the wider public. Crucially, such testing 
needs to understand the conditions that society might place upon 
the development and large-scale deployment of ERW technologies, 
as part of a wider responsible research and innovation programme”. 


Uncertainties 


Our analysis of the techno-economic potential of CDR by ERW is subject 
to several uncertainties, particularly variation in our baseline applica- 
tion rate and basalt mineralogy. It also identifies priority areas benefit- 
ting from more research into ERW under field conditions. 
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Fig. 4| Forecast increases in national bulk silicate production over the next 
century. Simulated future increases in bulk artificial silicate by products (slag, 
cement, kiln dust and cementitious demolition waste) production during the 
twenty-first century are given for China (a), India (b), USA (c) and Brazil (d), 
based onthe middle-of-the-road Shared Socioeconomic Pathway (Methods). 


Extrapolation of laboratory weathering rates to the field scale is a 
recognized potential source of uncertainty in calculated CDR rates by 
ERW? ‘22-24. We addressed this by Monte Carlo analysis of the fractal 
dimension accounting for uncertainty in the apparent reacting sur- 
face area of grains for ERW conducted at large geographical scales. 
Together with the chemical affinity effects accounted for in our model, 
we constrain some of the systematic errors embedded in prior ERW 
assessments 177)?2, 

Surface passivation, acomponent of chemical inhibition, occurs 
as weathering proceeds, creating leached layers and relatively stable 
secondary minerals, which potentially inhibit the mass transfer kinet- 
ics of elements from the dissolving surfaces of primary minerals. The 
current state of knowledge” precludes a detailed treatment of the role 
of surface passivation by formation of amorphous silica-rich surfaces 
for basalt grains added to agricultural soils. ERW analysis will benefit 
from future research to improve mechanistic insight and formulation 
of kinetic equations. 

It remains to be determined whether our theoretical analyses of 
the techno-economic potential for this CDR approach are consistent 
with findings from long-term field-scale ERW trials. Such trials are 
urgently required to assess weathering and CDR efficiency of freshly 
crushed rock grains with highly reactive surfaces added to agricultural 
soils subject to periodic wet-dry cycles during the growing season’. 
The potential for the trapping of weathered cations onion exchange 
surfaces or within secondary minerals other than carbonates, thus 
delaying or preventing land-ocean transfer, will depend on soil type, 
climate, hydrological conditions, application rate and management 
practices. The duration of the carbon sequestration rate, and the pos- 
sibility of CO, sink saturation with ERW oncroplands, are both poorly 
constrained by data, as for other land-based CDR strategies””*. Other 
areas for further research include: the quantification of biogeochemical 
transformations of carbon and nitrogen associated with organic and 
inorganic fertilization practices; atmospheric deposition; and the role 
of rhizosphere biology. 


Conclusions 


The techno-economic assessment of ERW’s potential to contribute 
large-scale CDR requires further integration of nation-by-nation 
quantitative analysis, together with large-scale pilot demonstrations 
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supported by fundamental process studies and public engagement. 
Our analysis identifies the engineering challenges if ERW were to be 
scaled up to help meet ambitious CDR goals as part of a wider portfolio 
of options’”"”, ERW estimated costs are comparable to (and generally 
lower than) current estimates for the intensive CDR technologies— 
BECCS and direct air capture—and have potential ancillary benefits 
through limiting coastal zone acidification and improving food and 
soil security. Nations that may have large ERW potential, including 
China, the USA and India, are all vulnerable to climate change and 
resultant sea-level rise**. Their high risks of economic damage™ and 
social disruption should provide the impetus for creative co-design 
of agricultural and climate policies. Success requires incentives and 
regulatory frameworks that overcome social and political inertia. Sili- 
cate demand of nations must also be met ina way that facilitates social 


acceptance” and preservation of biodiversity*”’. 


Deployment of any CDR strategy is inhibited by the concern that it 
may erode society's perception of the climate threat and the urgency 
of mitigation measures™. However, the ancillary benefits of ERW may 
aid its early use by creating ‘demand pull’, and relieve such concern. 
Innovative ‘climate-smart’ farming practices can be designed with ERW 
to draw down CO, and other greenhouse gases while recycling nutri- 
ents, aiding soil water storage, and supporting crop production*®”°, 
Such practices can help to restore deteriorating topsoils that under- 
pin food security for billions of people while maximizing the societal 
co-benefits needed to incentivize deployment”®. Financial, industrial 
and policy road-mapping that links the possibility to reliably set and 
achieve short-term and long-term goals is needed, including a broader 
analysis of risks? and co-benefits? *®”°, to determine the part that ERW 
might play in climate risk mitigation. 
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Methods 


CDR simulation framework 

Our analysis is based ona one-dimensional vertical reactive transport 
model for rock weathering with steady-state flow®*, and a source 
term representing rock grain dissolution within the soil profile 
(Supplementary Methods). The model accounts for changing dissolu- 
tion rates with soil depth and time as grains dissolve, and chemical inhi- 
bition of dissolution as pore fluids approach equilibrium with respect 
tothe reacting basaltic mineral phases, and the formation of pedogenic 
calcium carbonate mineral in equilibrium with pore fluids. Simulations 
consider basalts exhibiting relatively slow- versus fast-dissolution 
rates due to differing mineralogy (Supplementary Tables 1-3). Basaltic 
minerals undergo dissolution at different rates, with some minerals 
continuing to undergo dissolution and to capture CO, after the first 
year of application. Thus calculating representative annual CDR rates 
requires computing average rates derived from repeated basaltic rock 
dust applications (Extended Data Fig. 7). 


Transport equation. The calculated state variable in the transport 
equation is the dissolved molar equivalents of elements released by 
stoichiometric dissolution of mineral i, in units of moles per litre. pis 
volumetric water content, C,is dissolved concentration (moles per litre) 
of mineral itransferred to solution, tis time (years), gis vertical water 
flux (metres per year), zis distance along vertical flow path (metres), 
R; is the weathering rate of basalt mineral i (moles per litre of bulk soil 
per year) and C,, ,is the solution concentration of weathering product 
at equilibrium with the mineral phase i (equation (1)): 


ac, ac, é 
Por =~, +R; [= (1) 


Cog, i 


Mineral mass balance. The change in mass of basalt mineral i, B,, 
is defined by the rate of stoichiometric mass transfer of the elements 
in mineral ito solution. Equation (2) is required because we are con- 
sidering a finite mass of weathering rock, which over time can react to 
completion, as opposed to in situ weathering of the lithosphere, for 
example, when considering weathering and geomorphology”. 


Bena ci ) (2) 
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Removal of weathering products. The total mass balance over time 
for basalt mineral weathering allows calculation of the products trans- 
ported from the soil profile. The total mass of weathering basalt is 
defined as follows, where m is the total number of weathering miner- 
alsin the rock, ¢-is the duration of weathering (years) and L is the total 
depth of the soil profile (metres). 


Total mass of weathered basalt = 


> of. i: (3) 
d of. Ct, 2) dz+qf Cit, L) dt 


We define q as the net annual sum of water gained through precipita- 
tion” and irrigation®, minus crop evapotranspiration™, as calculated 
with high-spatial-resolution gridded datasets of the contemporary 
climate in this initial analysis, given the uncertainties in infiltration 
and irrigation patterns at equivalent high spatial resolution for 2050 
(Extended Data Figs. 8, 9; Supplementary Table 14). 


Rate law. We modelled application ofa crushed fast- or slow-weathering 
basalt, with specified mineral weight fractions and physico-chemical 
characteristics (Supplementary Tables 1-3). Rates of basalt grain 
weathering define the source term for weathering products and are 
calculated as a function of soil pH, soil temperature, soil hydrology and 


crop net primary productivity (NPP) using the linear transition state 
theory rate law® ®. Plant-enhanced basalt weathering is modelled 
empirically for annual and woody crops with power functions fitted to 
data (Supplementary Fig. 4; Supplementary Table 4). These functions 
represent the effects of a range of rhizosphere processes that accelerate 
the physical breakdown and chemical dissolution of minerals, including 
the activities of nutrient-scavenging mycorrhizal fungi that physically 
disrupt and chemically etch mineral surfaces, and bio-production of 
low-molecular-weight organic compounds and chelating agents. 

Soil pH of each grid cellis dynamically calculated from the alkalinity 
mass and flux balance for an adaptive time-step, controlled by min- 
eral dissolution rates, following initialization with a topsoil (O-15 cm) 
pH value based on field data from global soil databases (Supplemen- 
tary Table 14); soil pH buffering capacity is accounted for with an em- 
pirical buffer function® relating soil pH to alkalinity. The soil Pco, depth 
profile of a grid cell is generated with the standard gas diffusion equa- 
tion”, scaled by crop NPP x 1.5 to account for combined autotrophic 
and heterotrophic respiration”. The alkalinity balance considers net 
acidity input during crop growth for biomass cations removed from 
the field®’, and secondary mineral precipitation of calcite’®. 


Model advances 

We incorporate three further relevant advances into the above 
one-dimensional vertical transport model with steady-state flow. 
First, we provide a numerical basis for calculating weathering rates 
using log-normal particle size distributions of basalt grains produced 
by mechanical crushing and grinding for soil amendment”. This 
conceptualization improves on the simplified case of a single mean 
particle diameter, previously used in ERW calculations’ *”?, Second, 
we apply the fractal dimension for surface roughness to relate reacting 
surface area to basalt mass across physical scales of weathering from 
the laboratory (in which the weathering kinetic parameter values are 
empirically determined) to the field (where model results reflect CDR 
operations)”. The fractal dimension effectively provides a means of 
consolidating measurements taken at different scales, and accounts 
for uncertainties in grain topography and porosity” that affect 
mass transfer rates from rock grains to flowing soil water. Finally, 
we calculate mean rates of rock dust weathering and CDR following 
annual applications by tracking cohorts of particles applied over 
a 10-year time horizon and their mineral composition (Extended 
Data Fig. 7). 


Baseline simulations 

Using this modelling framework, we analysed a baseline application 
rate of 40 tonnes per hectare per year (equivalent to a <2 mm layer of 
rock powder distributed over the croplands), which falls within the 
range of basalt amendments shown toimprove crop production in field 
trials*. Net CDR is defined as the difference between CO, capture by 
ERW as dissolved inorganic carbon and soil (pedogenic) carbonate and 
the sum of CO, emissions for logistical operations. Carbon emissions 
per unit mass of ground rock depend on particle size (Extended Data 
Fig. 10), the CO, emissions per kilowatt hour of electricity generated 
from componentenergy sources (fossil fuels, nuclear and renewables), 
as well as the carbon costs of sourcing and transporting the silicate 
materials. Rock grinding to reduce particle size and maximize CDR is 
the primary energy-consuming operation in ERW”>, 

Assessment of basalt transport from source regions to croplandsis 
based on road and rail network analyses to calculate distances, costs 
and carbon emissions for each scenario (Supplementary Methods sec- 
tion 2.3). Our approach improves on prior analyses, which assumed a 
fixed radius between rock dust source and site of application”. We go 
beyond global cost estimates” by using national fuel (diesel), labour 
and infrastructure costs to undertake logistical operations, and the 
price of energy inputs to grind rocks. Our analysis thus represents 
the first techno-economic assessment in which detailed ERW carbon 
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and economic costs vary within and between nations and account for 
socio-technical uncertainties in energy production. 


CDR 

We calculate CDR by ERW of crushed basalt applied to soils via two 

pathways: (1) the transfer of weathered base cations (Ca”*, Mg”*, Na‘ and 

K*) from soil drainage waters to surface waters that are charge-balanced 

by the formation of HCO, ions and transported to the ocean (equa- 

tion (4)), and (2) formation of pedogenic carbonates (equation (5)). 
Pathway 1 for calcium ions: 


CaSiO; +2CO, +3H,0 > Ca?* +2HCO; +H, SiO, (4) 


Pathway 2 for calcium carbonate formation: 


Ca?* +2HCO; > CaCO; +CO, +H,0 (5) 


Monovalent and divalent base cations are released from basaltic min- 
erals by dissolution based on stoichiometry (Supplementary Table 2). 
CDR, via pathway 1, potentially sequesters two moles of CO, from the 
atmosphere per mole of divalent cation. However, ocean carbonate 
chemistry reduces the efficiency of CDR (7) to an extent depending on 
ocean temperature, salinity and the surface ocean dissolved CO, con- 
centration. We calculate 7 for average ocean temperature (17 °C), salin- 
ity (35%) and a Representative Concentration Pathway (RCP) 8.5 
simulation for 2050 (the worst-case scenario) for dissolved Peo, 0f 600 
patm, giving 7 = 0.86, that is, 0.86 moles of CDR per mole of monovalent 
cationand1.72 moles of CO, removed per mole of divalent cation added 
to the oceans”®. For pathway 1, the efficiency of CDR = x }(moles of 
monovalent cations) + 27 x } (moles of divalent cations). 

CDR via pathway 2 can occur if dissolved inorganic carbon derived 
from atmospheric CO, precipitates as pedogenic carbonate, and 
sequesters 1 mole of CO, per mole of Ca” instead of 1.72 moles of CO, 
via pathway 1 and is therefore less efficient. Thus for any given grid 
cell, we compute CDR by ERWas the alkalinity flux in soil drainage and 
pedogenic calcite precipitation. Possible CO, degassing due to changes 
in surface water chemistry during transport in large river systems” is 
not considered. 


Cost assessment modelling 

Anoverview of the environmental costs model and its linkages with the 
performance model is presented in Extended Data Fig. 4. We include 
contributions to total cost of (1) mining, (2) processing”, (3) distri- 
bution and transport and (4) spreading on agricultural land. We con- 
sidered how the cost of energy and the carbon emissions varied with 
grinding to different particle size distributions (Extended Data Fig. 10). 
Grinding to finer particles requires greater energy and results in higher 
carbon emissions. We defined the particle size distribution by the 
p80 value; p80 is defined as 80% of the particles having a diameter 
less than or equal to a specified particle size. We calculated the opti- 
mized p80 that results in maximum net CDR for each grid cell and this 
was conducted for different fractions of a country’s crop area (0.1to 
1.0 at O.lincrements), ordered according to weathering potential. For 
a given p80 value, we calculate the weathering rate for each grid cell, 
sort them in descending order and find the grid cells that comprise 
the cumulative fraction of total land for each incremental increase 
inland area. 

Optimization is conducted by country for each of the two types of 
basalt and their log-normal particle size distributions (Supplementary 
Tables 1-3). Country-specific electricity production and the forecast 
fractional contributions to electricity production by different energy 
sources (coal, natural gas, oil, solar photovoltaics, concentrated solar 
power, hydropower, wind, marine) for 2050 are based on BAU, thatis, 
currently implemented energy policies, and energy projections con- 
sistent with a2 °C warming scenario (Extended Data Fig. 9)°. National 


CO, emissions for electricity generation consistent with both scenarios 
were based on results reported in ref. ° (Supplementary Tables 6-9). 
Industrialized nations (for example, Canada) consume up to about 
2% of their total energy production on rock comminution (crushing 
and grinding) processes”. We assume a future maximum upper limit 
of 3% energy consumption for all nations, based on the rationale that 
current rates for developed nations grow from around 2% today in 
line with national projected energy production® in 2050 (Extended 
Data Fig. 2). 

Distribution costs and emissions were calculated by performing 
spatial analysis with ArcGIS (http://www.arcgis.com) software. Basalt 
rock sources were identified from the GLiM rock database”, excluding 
those in protected areas”. We then performed a global transport (rail 
and road) network analysis by modelling a logistic ERW supply by creat- 
ing an origin-destination cost matrix using GIS*°*". For larger datasets, 
the origin-destination cost matrix searches and measures the least-cost 
paths along the network from multiple origins to multiple destinations 
to identify the most cost-effective or shortest route between a source 
and destination. Transport analyses used the lowest-emission option 
between rail and road network to calculate distribution costs and CO, 
emissions (Supplementary Tables 10-12). Freight-rail emissions were 
obtained from 2050 projections of reduced carbon emissions follow- 
ing improvements in energy efficiency®. Rail CO, emissions were the 
same for boththe BAU and 2 °C scenarios. For road transport, we con- 
sidered estimated energy consumption of currently or soon-to-be 
available heavy electric trucks 1.38 (kWhkm)® and projected carbon 
emissions in the electricity sector of each country for BAU or the 2 °C 
scenario». 


Forecasting bulk silicate waste production 

We developed a model that relates global per capita material produc- 
tion (for cement) or consumption (steel) Pto per capita gross world 
product (GWP)*** through historical global data using nonlinear least 
squares (equation (6)): 


Pager’ (6) 


where aand bare regression constants. The derived saturation value, 
a, was used in a further regression through national data normalized 
to 2014 production and GDP (equation (7). 


P= Pop x [1+ (m +r) x AGDP] x ere" SP) (msacor) (7) 


where P,,-is the global per capita consumption ina given reference year 
(2014), AGDP is the deviation of the per capita gross domestic product 
from the reference year, and mand rare regression constants. These 
results were used together with averaged projections of future GDP 
(Supplementary Table 14) from the ‘middle-of-the-road’ Shared Socio- 
economic Pathway (SSP2) to derive nationally resolved projections of 
future per capita consumption/production®. SSP2 potentially repre- 
sents the largest material production pathway, as other SSPs forecast 
lower consumption or economic growth producing 30% to 50% less 
material globally. We have not considered the penetration of recycling 
into steel production beyond its current rate. Cement and cement kiln 
dust have no capacity to be recycled as cement. The total production/ 
consumption at a given time, 7(0), was calculated by multiplying the 
population, Pop(t), by production or consumption (P). We assume 
that 115 kg of cement kiln dust is produced as a by-product in kilns for 
every tonne of clinker, and have modelled the production of demolition 
waste following an average 50-year service life (normally distributed 
with a standard deviation of 10 years)**. The ratio of pig iron to steel 
production (0.72) was obtained using linear regression of 1960-2014 
data, negating the need to explicitly model pig iron displacement from 
scrap recycling, and assuming that the scrap ratio remains unchanged. 
All steel and blast furnace slag was considered available for reaction 


with CO,. Between 2006 and 2014, 185 kg of blast furnace slag and 
117 kg of steel slag was produced for every tonne of crude steel®. 


Data availability 


Datasets on global crop production and yield are available at 
http://www.earthstat.org/, accessed on 18 December 2019. Data- 
sets on global crop irrigation are available at https://zenodo.org/ 
record/1209296, accessed on 18 December 2019. Datasets on global 
precipitation are available at http://www.climatologylab.org/terra- 
climate.html, accessed on 18 December 2019. Datasets on global soil 
surface pH are available at https://daac.ornl.gov/SOILS/guides/HWSD. 
html, accessed on18 December /2019. Datasets on global soil tempera- 
ture are available at https://esgf-node.IInI.gov/search/cmipS/, accessed 
on18 December 2019. Datasets on diesel prices are available at https:// 
data.worldbank.org/indicator/EP.PMP.DESL.CD. Datasets on mining 
costs are available at http://www.infomine.com/. Datasets on gross 
national income per capita are available at https://data.worldbank. 
org/indicator/ny.gnp.pcap.pp.cd. Datasets for projections of future 
GDP linked to Shared Socioeconomic Pathways are available at https:// 
tntcat.iiasa.ac.at/SspDb. Source data are provided with this paper. 


Code availability 


The Matlab codes developed for this study belong to the Leverhulme 
Centre for Climate Change Mitigation. The authors will make them 
available upon reasonable request. 
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Extended Data Fig. 1| Performance model schematic. Detailed methods are provided in Methods sections ‘CDR simulation framework’ and ‘Model advances’. 
Spatially resolved key drivers are mapped in Extended Data Fig. 8; sources given in Supplementary Table 14. 
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Extended Data Fig. 5| Cumulative silicate demand by nation. Results are 
shown for the seven nations of the world (ag) and the five European nations 
(h-I) with the highest CDR, as ranked by net CDR capacity, with increasing 
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undergoing dissolution at different rates, with some minerals continuing to 
undergo dissolution and capture CO, after the first year of application. Such 
simulations allow average rates of weathering and CDR from repeated basaltic 
rock dust applications to be computed. Our extended theory underpinning the 
simulation framework tracks cohorts of particles applied each year and their 
mineral composition over time to account for cumulative effects 


(Supplementary Methods). 
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® Check for updates 


The phylum of annelids is one of the most disparate animal phyla and encompasses 
ambush predators, suspension feeders and terrestrial earthworms’. The early 
evolution of annelids remains obscure or controversial’, partly owing to discordance 


between molecular phylogenies and fossils”*. Annelid fossils from the Cambrian 
period have morphologies that indicate epibenthic lifestyles, whereas phylogenomics 
recovers sessile, infaunal and tubicolous taxa as an early diverging grade®. 
Magelonidae and Oweniidae (Palaeoannelida’) are the sister group of all other 
annelids but contrast with Cambrian taxa in both lifestyle and gross morphology”®. 
Here we describe a new fossil polychaete (bristle worm) from the early Cambrian 
Canglangpu formation’ that we name Dannychaeta tucolus, whichis preserved within 
delicate, dwelling tubes that were originally organic. The head has a well-defined 
spade-shaped prostomium with elongated ventrolateral palps. The body has a wide, 
stout thorax and elongated abdomen with biramous parapodia with parapodial 
lamellae. This character combination is shared with extant Magelonidae, and 
phylogenetic analyses recover Dannychaeta within Palaeoannelida. To our 
knowledge, Dannychaeta is the oldest polychaete that unambiguously belongs to 
crown annelids, providing a constraint on the tempo of annelid evolution and 
revealing unrecognized ecological and morphological diversity in ancient annelids. 


Annelida Lamarck, 1809 
Palaeoannelida Weigert & Bleidorn, 2016 
Magelonidae Cunningham & Ramage, 1888 
Dannychaeta tucolus gen. et sp. nov. 


Etymology. Danny refers to Danny Eibye-Jacobsen, for his contribu- 
tions to our understanding of early annelids; chaeta (Latin), bristle; 
tubus (Latin), tube; colus (Latin), dwelling in. 

Holotype. YKLP 11382 part and counterpart (YKLP, Yunnan Key Labora- 
tory for Palaeobiology) (Fig. 1and Extended Data Fig. 1). 

Referred material. YKLP 11383-11402 (Figs. 2, 3 and Extended Data 
Figs. 2-6). 

Horizon and locality. Canglangpu Formation, Cambrian stage 3, 
Hongjingshao Member (around 514 million years ago), at southwest 
of Guanshan reservoir, Chenggong, Kunming, China’. 

Diagnosis. Elongated, slender polychaetes with organic dwelling tubes. 
Head with anteriorly tapering spade-shaped prostomium, with paired 
palps attached ventrolaterally near the mouth. Body heteronomously 
segmented, with a wider thorax containing at least eight chaetigers. 
Parapodia with lateral lamellae in the posterior part of the abdomen. 
Abdominal parapodia biramous, unknown in thorax. Capillary chaetae 
in both rami, occurring in tight parallel bundles. 


Description. The holotype (Fig. 1 and Extended Data Fig. 1) is incom- 
plete posteriorly (around 40 mm long), has a wider anterior region (tho- 
rax; maximum width, 3.9 mm) (Fig. la—e) and abdomen (about 1.9 mm, 
excluding parapodia) (Fig. 1h, i). The number of thoracic segments 
would consist of at least 8 chaetigers, extrapolating from segment 
spacing (15.5 mm length and 1 segment per 1.9 mm). The prostomium is 
aspade-shaped lobe (Figs. 1c—e, 2and Extended Data Figs. la-g, 2a—h) 
and is longer (approximately 4 mm) than it is wide (around 2 mm). 
The relief of overlapping anatomical features preserved on different 
planes indicates that the prostomium is dorsal of the palps (Figs. 1d-f, 
2b-d). The palps cross over each other in the holotype (Fig. 1d-f), are 
incompletely preserved anteriorly, but are at least 30% the length of 
the thorax. A specimen in ventral view shows palps that insert ven- 
trolaterally, anterior of a putative burrowing organ (Fig. 2a—d). The 
prostomium has a faint pair of tapering ridges (Fig. 2d). The gut is pre- 
served as acarbonaceous film (Fig. 3f) that terminates adjacent to palp 
bases (Fig. 1d-f), indicating palp attachment near the mouth opening. 

Abdominal parapodiaare distinct lobes, projecting around 300 um 
from the body (Fig. 3c-e). Anterior abdominal chaetae in the holotype 
are around 500 pm long (Fig. 1h). In narrower midbody fragments, 
chaetae are approximately 800 pm long (Fig. 3d), suggesting that the 
chaetation is variable along the body. Abdominal chaetae are in tight 
fascicles (Figs. 1j, 3c—e and Extended Data Fig. 3k) and are the clearest 
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Fig. 1| Holotype specimen YKLP 11382 of D. tucolus. a, b, Part (a) and 
counterpart (b), imaged using direct light. c, Interpretative drawing of the 
counterpart. The colour scheme is described inf. d, Anterior region of the part. 
e, Anterior region of the counterpart under fluorescent light. The specimen is 
mirrored to facilitate the comparison with the part. f, Interpretative drawing of 
the anterior, based on both part and counterpart. g, Fe map from SEM- 
energy-dispersive X-ray spectroscopy (SEM-EDX) analysis. h, i, Abdominal 
region, counterpart, obtained using direct light (h) and fluorescent light (i). 

j, Magnification of the parapodium and chaetae. ch, chaetae; gu, gut; mo, 
mouth; pa, parapodium; pal, parapodium of segment one; pa8/9, parapodium 
of segment eight or nine; pl, palp. Scale bars, 200 pm (j), 1mm (h, i) and2mm 
(a, b, d, e; the scale bar in e also applies tog). 


using fluorescence microscopy (Fig. 3d). The chaetae are directed 
slightly anteriorly (Fig. 1i); this feature is used to orient fragmentary 
specimens. In two specimens, lateral lamellae occur adjacent to chaetal 
bundles (Fig. 3g). Lamellae are crescent-shaped and approximately half 
the body width in length, with a dorsolateral (Extended Data Fig. 1k) to 
dorsal (Extended Data Fig. 5f-h) placement. As the rami are often paral- 
lel, abiramous morphology is revealed by subparallel chaetal bundles 
(Fig. 3d) or rare oblique views (Fig. 3h and Extended Data Fig. 5). Fine 
details of the chaetal morphology are obscure but are consistent with 
capillaries in the abdomen (Fig. 3d). 

The pygidium is never well-preserved, but one specimen has putative 
pygidial cirri (Extended Data Fig. 5j, k). An ovoid structure between 
chaetiger one and three is of uncertain identity but resembles a blood 
lacuna (Fig. 1d-f and Extended Data Figs. 1h-j, 4g). 

Eight specimens are preserved within a structure that extends 
beyond the body margin and chaetae (Fig. 3a-f, i-k and Extended 
Data Figs. 2-4). This structure is parallel to the body axis, approxi- 
mately four times the body width (excluding parapodia) (Fig. 3a) with 
asharp boundary with the matrix, which is visible using light (Fig. 2f) 
and fluorescence (Fig. 3h) microscopy and elemental maps for iron 
(Extended Data Fig. 3f). The structure is consistent with a dwelling 
tube, or tube-lined burrow. The tube has a slightly darker appearance 
relative to the matrix and lacks identifiable agglutinated bioclasts or 
grains. Tubes have a slight relief (Extended Data Fig. 3g) and sometimes 
have thick walls at their margins (Extended Data Fig. 6c—e), owing to 
compaction. As with the body, the tube contains iron (Fig. 3f) localized 
to small grains in the matrix that appear bright in backscatter images 
obtained using scanning electron microscopy (SEM) (Extended Data 
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Fig. 2| Anterior region of D. tucolus.a, YKLP11390b, anterior fragment with 
head and palps. b, Anterior region of YKLP 11390b, low-angle illumination 

from the northwest. c, Region as inb, obtained using fluorescent light. 

d, Interpretative drawing outlining the anatomical features (colour scheme as 
in Fig. 1f).e, YKLP11393b specimen, preserving the anterior region, dwelling 
tubes anda partial abdomen. f, YKLP 11393a, part. g, YKLP 11393b, details 

of the anterior region. The white arrowheads indicate the tube margins. 

h, Interpretative drawing of g. bo?, possible burrowing organ; pro, prostomium; 
rd, prostomial ridge. Scale bars, 500 pm (b), 2mm (a, g) and3 mm(e, f). 


Fig. 3f). This is consistent with an organic composition, with organic 
material acting as a substrate for pyrite formation®. Specimens pre- 
served in tubes vary from well-preserved with delineated parapodia 
and/or chaetae (Fig. 3c—e and Extended Data Fig. 3) to highly effaced, 
indicating in situ decay’ (Extended Data Figs. 4, 6). Preservation quality 
varies along individual specimens (Extended Data Fig. 3a—d). 


Discussion 


Owing to low preservation potential’, annelid body fossils are rare 
and distributed discontinuously and unevenly through geological 
time’. Diverse fossil polychaetes are known from early and middle 
Cambrian deposits (for example, Sirius Passet*” and Burgess Shale”), 
but rare from China®"*. Cambrian annelids typically are generalized 
polychaetes in morphology, with well-developed biramous parapodia 
(suggesting motility’), elongated chaetae anda pair of palps”*". They 
cannot be assigned to extant annelid subclades*” and lack proposed 
annelid synapomorphies (for example, the prostomium-peristomium 
head structure’ and typically pygidial cirri”) and are interpreted as 
stem-group annelids®. The lack of Cambrian crown annelids has 


prompted hypotheses of relatively late crown-group diversification, 
perhaps during the late Cambrian to Ordovician period’, when jawed 
polychaetes become diverse and abundant”. 


15,17 ps 


Both molecular’ and morphological” phylogenies have converged 
onscenarios in which annelids evolved from polychaete-like ancestors. 
However, reconciling fossil and phylogenomic evidence has been chal- 
lenging”*”’. Molecular phylogenies recover a grade of infaunal, sessile 
and tube-dwelling taxa as deep branches’, including Magelonidae, 
Oweniidae, Chaetopteriformia and Sipuncula. These groups differ 
from Cambrian polychaetes in terms of gross morphology and inferred 
mode of life®. A tube-dwelling annelid ancestor has previously been 
proposed” (although see ref. 7°), which is contradicted by interpreta- 
tions of the fossil record’, and morphological hypotheses regarding 
the origin of annelid body plan features, for example, segmentation 
and parapodia’®”°. Fossil specimens interpreted as sipunculans (which 
phylogenomic studies recover within Annelida’”) are known from the 
early Cambrian Chengjiang Lagerstatte” but are rare and poorly known. 
If correctly interpreted, the oldest fossil crown annelids therefore 
belong to taxa that have lost most annelid synapomorphies*, includ- 
ing segmentation”. Magelonidae and Oweniidae (Palaeoannelida’) are 
recovered as the sister group of all other annelids and so have featured 
prominently in recent discussions of the annelid ancestor>”’. These 
families are unusual among polychaetes, as they lack nuchal organs, 
possess monociliated epidermal cells and simple nervous systems’. 

Dannychaeta is dissimilar in gross morphology to that of previously 
known Cambrian polychaetes, but shares derived characters with extant 
Magelonidae. A spade-shaped prostomium with ventrolateral palps (Fig. 2) 
is characteristic of Magelonidae, which also have a differentiated thorax 
composed of eight or nine chaetigers”**. Other well-known Cambrian 
annelids lack a clearly demarcated head and prostomial lobe? (but have 
lateral palps"*), suggesting that the typical annelid head structure evolved 
after the origin ofasegmented body, parapodia and palps*. The presence 
ofadifferentiated head in Dannychaetais uniqueamong Cambrian annelid 
fossils, indicating in itself that Dannychaeta has a phylogenetic position 
proximal to or within the annelid crown group’®. The importance of the 
putative blood lacuna is uncertain, but small ring-shaped vessels occur 
in Magelona”™ and alarger lacuna occurs in Poecilochaetus™®. However, in 
both species, these are placed more anteriorly, nearer the prostomium. 


Fig. 3 | Morphological details of D. tucolus. 

a-f, YKLP 11383b. a, Abdominal region withina 
dwelling tube. b, Interpretative drawing ofa (colours 
as in Fig. 1f). Anteriorly, black arrowheads indicate the 
body margin and white arrowheads indicate the tube 
edge. c, Magnification showing the segmented body 
and parapodia. d, Magnification of the chaetal 
fascicle. e, Interpretative drawing outlining the 
chaetal fascicles. f, Magnification and SEM-EDX 
maps. g, Magnification of YKLP 11389a showing the 
parapodia and lateral lamellae. h, Magnification of 
YKLP 11384a showing the biramous parapodiaand 
chaetae. The white arrowhead indicates the tube 
edge. plm, parapodial lamella. Scale bars, 500 pm (d), 
1mm (c,h),2mm (f,g)and4mm (b). 


Although not widespread (around 7% of species”), several magelonid 
species live in tube-lined burrows” with an organic or parchment-like 
composition, some adhered sediment grains and/or bioclasts. These 
tubes are similar in inferred construction materials and dimensions 
relative to the body as in Dannychaeta. The presence of both obliquely 
and parallelly oriented specimens (Extended Data Figs. 2, 4) suggests 
at least some transport before burial. 

We reconstruct Dannychaeta as a sessile, infaunal polychaete that 
fed in the water column using elongated palps (as in extant Mageloni- 
dae”™*”*) (Fig. 4a, b). Our phylogenetic analyses recover Dannychaeta 
in the magelonid stem group (Fig. 4c and Extended Data Figs. 7, 8). 
Parapodial and chaetal morphologies differ in some details in Dan- 
nychaeta and Magelonidae. In extant Magelona, the chaetae of the 
abdomenare hooded hooks, which are generally shorter than the tho- 
racic capillary chaetae and occur in rows”, whereas abdominal chae- 
tae in Dannychaeta resemble capillaries held in bundles (Fig. 3c-e). 
Hooded hooks in some families share details of ultrastructure and 
formation”, resulting in a proposed close relationship between Capi- 
tellidae, Spionidae and Magelonidae”, which is not supported by 
phylogenomics’. The absence of hooded hooks in Dannychaeta may 
therefore provide consilient evidence of convergent chaetal evolu- 
tion in these families. Parapodial lamellae also differ in some details 
in Dannychaeta, as they are dorsolaterally placed, but occur partially 
in inter-ramal space in some species in Magelona™. Dannychaeta is 
larger than extant magelonids that are typically less than 1mm wide”, 
although tubicolous species (for example, Magelona alleni) achieve 
the largest widths and share a more-robust anterior region (thorax) 
with Dannychaeta”. 

Other non-pleistoannelid polychaetes also share some characters 
with Dannychaeta. Spiochaetopterus has elongated palps, a differenti- 
ated anterior region and builds organic tubes”®. However, chaetopterids 
have ridge-like parapodia (tori) with short, hooked chaetae (uncini’), 
which are distinct from laterally projecting, lobate parapodia in Dan- 
nychaeta. The head of Spiochaetopterus is neither a distinct anterior 
lobe nor spade-shaped. Nevertheless, chaetopterids branch proximally 
to Palaeoannelida’. A close relationship between Dannychaeta and 
chaetopterids would also suggest the presence of tubicolous, early 
branching crown-group annelids in the early Cambrian. 
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Fig. 4| Reconstruction of Dannychaeta. a, Technical drawing showing 
proportions, body regions and gross anatomy. The colour scheme is asin 

Fig. 1f. The dorsal view (left) and right lateral view (right) are shown. Parapodia 
in the thorax that were not observed but inferred are shown in grey. b, Life 
reconstruction showing D. tucolus living in buried tubes. Artwork was created 
by R. Nicholls. c, Bayesian phylogenic analysis (365 characters, 143 taxa, mki+ 
gamma model), incorporating D. tucolus. Numbers at the nodes are posterior 
probabilities; the scale bar indicates the number of substitutions per site (see 
Extended Data Figs. 7, 8 for full results and additional information). Fossil taxa 
are indicated by anasterisk. 


Annelid tube fossils are well-documented’®. Some late Ediacaran 
and early Cambrian fossil tubes are tentatively assigned to annelids 
or described as of ‘annelid grade”, but lack diagnostic features that 
would enable confident phylogenetic placement. For example, the 
controversial late Ediacaran tubular fossil Cloudina has been tentatively 
reconstructed as an annelid based on the presence of a tubular gut®’. 
Tube dwelling has evolved several times among extant annelids, and 
‘annelid-mimicking’ taxa with lophophorate affinities are well-known 
from the early Palaeozoic era”, indicating that caution should be exer- 
cised when assigning fossil tubes. 

Regardless of the phylogenetic position and fossil record of sipun- 
culans*”’”, Dannychaeta confirms that crown annelids are minimally 
early Cambrian in age, revealing early exploration of sessile ecological 
niches, more than a hundred million years before other unambigu- 
ous examples of tubicolous annelids””. Dannychaeta reveals that 
stem-group annelids coexisted with members of the crown group in 
the early Cambrian and exhibited a diversity of life modes, including 
epibenthic” and sessile forms. 
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Methods 


The specimens were studied, photographed at the Yunnan Key Labora- 
tory for Palaeobiology, Yunnan University (YKLP) and at the University 
of Exeter, and are deposited at the YKLP. Photographs were taken using 
a Canon EOS 5DSR coupled with a MACRO 100-mm lens, and a Leica 
DFC7000T linked to a Leica M205 FA fluorescence microscope. The 
excitation wavelength of GFPLis 480 nm, and the excitation wavelength 
of RFPLis 546 nm. Images were obtained with a gain value of 3.3, satura- 
tion value of 52.00 and gamma value of 0.92. The external light source 
of the fluorescence microscope was a LEICA KL 300 LED, used for tak- 
ing white-light images. SEM images were collected using a FEI Quanta 
650 FEG using an accelerating voltage of 25 kV and a working distance 
of 12.4 mm. EDX analyses were carried out with an EDAX Pegasus using 
accelerating voltages of 25-30 kV witha working distance of 12.4-13 mm. 
Phylogenetic analyses were based ona previously published character 
matrix for annelids and their close relatives®, which has been updated suc- 
cessively with the addition of new taxa and fossil data! °!, We performed 
Bayesian analyses using MrBayes v.3.2.6 using the mki model with the 
Lewis correction for the scoring of only informative characters”, with 
default priors for all parameters (that is, all trees were given equal prior 
probability). Bayesian analyses with and without topological constraints 
based on phylogenomic trees were used to investigate whether the con- 
flicting topologies recovered from morphological and molecular data 
affected the phylogenetic position of D. tucolus (Extended Data Fig. 7). 
These constraints were constructed by incorporating results from the 
most recent” and taxon-rich®* phylogenomic analyses of annelids and 
are outlined in detail in the Supplementary Information. We followed a 
previous study” by excluding Arkonips and Guanshanchaeta from the 
character matrix as they contain redundant character scores. For both 
analyses, 100 million generations were requested, with the analysis stop- 
ping once the average deviation of split frequencies dropped below 0.01. 
Convergence was then assessed using effective sample size (>200) and 
potential scale-reduction factor (about 1.0) values for all model param- 
eters. Parsimony analyses without topological constraints were con- 
ducted in TNT 1.5* (courtesy of the Willi Hennig Society), using both equal 
weights and implied weights with k=10. Bremer support and jack-knife 
and bootstrap frequencies from 1,000 replications were inferred for 
equal-weight trees and frequency differences were inferred from 1,000 
replicates of symmetric resampling for the implied-weight trees. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 
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Extended Data Fig. 1| Additional details of the holotype specimen. a, Entire 
specimen YKLP11382a, including a possible posterior region. b, Possible 
posterior region from the region shown ina.c, YKLP 11382a, boxes show 
regions used for details of the head and interpretative drawing and fluoresence 
microscopy. d, Interpretative drawing of the specimen showninc. 

e, YKLP11382b. Image shows the regions used in SEM-EDX mapping and 
fluorescence microscopy. f, Magnification of the anterior region, showing the 
prostomial lobe and palp attachment. g, Interpretative drawing of the region 


shown inf, showing major anatomical features. The colour scheme is as in 

Fig. 1f. The palp attachment was inferred with additional information from the 
counterpart. h, SEM-EDX elemental maps. Element names for each map are 
shownin the bottom right corner. i, Fluorescence image showing the anterior 
region of the body in YKLP 11382b, including the gut, palps and putative blood 
lacuna.j, Fluorescence image of the putative blood lacunain YKLP 11382a. 

k, Magnification of the posterior fragment associated with the holotype with 
parapodial lamellae. bl, blood lacuna. 


Extended Data Fig. 2 | Specimen of D. tucolus YKLP 11393 preservedina 
dwelling tube. a, Specimen YKLP 11393b, for which the anterior region, 
dwelling tube anda partial abdomenare preserved. b, Specimen YKLP 11393a, 
for which the anterior region, dwelling tubes anda partial abdomen are 
preserved. c, Specimen YKLP 11393b. Magnification of the anterior region 
(boxed area shown ina). The white arrowheads indicate the tube margins. 

d, Interpretative drawing of the region showninc.e, Magnification of the 
prostomium and palps shown inc. f, Specimen YKLP 11393a. Magnification of 
the anterior region (boxed area shown in b). The white arrowheads indicate the 


tube margins. g, Magnification of the same region as inf, imaged using 
fluorescence microscopy. h, Interpretative drawing of the region showninf 
and g.i, Poorly preserved abdominal region, from the region shownina, 
imaged using direct light.j, Same region as ini, imaged using fluorescence 
microscopy. k, Poorly preserved abdominal region, from the region shown inb, 
imaged using direct light. 1, Same region as ink, imaged using fluorescence 
microscopy. m, Thoracic chaetiger showing parapodia and chaetae fromthe 
region shown inb.n, Thoracic chaetiger showing parapodia and chaetae from 
the regionshownina. 
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Extended Data Fig. 3| Additional details of specimen YKLP 11383, 
preserved inside a dwelling tube parallel tothe bedding. a, YKLP 11383a 
midbody fragment preserved inside a dwelling tube, part. b, Interpretative 
drawing of the specimen shown ina, regions demarcated by black and blue 
brackets represent decayed and well-preserved regions of the body fossil, 
respectively. c, YKLP 11383b, midbody fragment preserved inside a dwelling 
tube, counterpart. d, Interpretative drawing of the specimen showninc. 

e, Magnification of a well-preserved region of YKLP 11383a as shownina, 
showing 11 chaetigers preserved inside the dwelling tube. f, SEM backscatter 


image of asimilar region to EDX maps shown in Fig. 3f, showing bright grains 
that are associated with the tube and body fossil. The arrowheads indicate the 
left tube margin and pyritized tube wall. g, Section of the fossil shownina 
photographed under low-angle light to indicate the relief of the dwelling tube. 
h, Section of the fossil shown in c photographed under low-angle light to 
indicate the relief of the dwelling tube. i, Magnification of three chaetiger 
regions of the region shown ine.j, Same region as ini, photographed using 
fluorescence microscopy. k, Magnification of the individual parapodium 
shown inj, photographed using fluorescence microscopy. 


Extended Data Fig. 4| Specimens YKLP 11385a, YKLP 11387 and YKLP11401, | the regionshowninb. The white filled arrows indicate the tube margins. 


showing effaced specimens preserved in dwelling tubes. a, YKLP 11385a, e, Same region as in d imaged using fluorescence microscopy. f, YKLP 11401, 
anterior fragment comprising the thorax and abdomen preserved inside effaced specimen preserved ina dwelling tube, including the putative blood 
dwelling tube. b, YKLP 11387a, anterior fragment preserving the thorax and lacuna. g, Magnification of the region shown inf, showing the gut and possible 
abdomen.c, YKLP 11387b, anterior fragment preserving the thorax and blood lacuna. Brackets in a—c and findicate the position of the thoracic region. 


abdomen. d, Magnification of the abdominal chaetigers in YKLP 11387a, from 
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Extended Data Fig. 5 | Additional details of specimen YKLP 11389, showing 
details of parapodia, parapodial lamellae and the posterior region. 

a, YKLP11389b, counterpart, posterior fragment preserving parapodia and 
chaetae. b, YKLP 11389a, part showing preservation of lateral parapodial 
lamellae. c, Magnification of five chaetigers from the region shownina. 

d, Magnification of chaetiger from the region shown inc. e, Magnification of 
chaetiger ind. f, Magnification of chaetiger, preserving the parapodial lamellae 


Imm 


fromthe region shown inb. g, Chaetigers preserving the parapodial lamellae 
fromthe region shown inb.h, Same region as ing, showing the parapodial 
lamellae, imaged using fluorescence microscopy. i, Posterior region as shown 
ina, with putative pygidial cirri.j, Magnification of putative pygidial cirriini. 
k, same regionas inj, imaged using fluorescence microscopy. plm, parapodial 
lamella; pyc, pygidial cirri. 


Extended Data Fig. 6 | Specimen YKLP 11384, adecayed specimen 
preserved ina dwelling tube. a, YKLP 11384a (the part),awholespecimenina 
dwelling tube. The white arrowheads indicate the tube margin. b, YKLP 11384b 
(the counterpart), a whole specimen ina dwelling tube. The white arrowheads 
indicate the tube margin. c, Details of YKLP 11384b, showing putative blood 


lacuna and gut. The white and black arrowheads indicate the tube and body 
margins, respectively. d, Same view as inc, but imaged using fluorescence 
microscopy. Note the thick appearance of the tube margin. e, Detail of 
YKLP 11384a, showing the preservation of the tube wall, indicated by white 
arrowheads. 
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The cortex organizes sensory information to enable discrimination and 
generalization’ *. As systematic representations of chemical odour space have not yet 
been described in the olfactory cortex, it remains unclear how odour relationships are 


encoded to place chemically distinct but similar odours, such as lemon and orange, 
into perceptual categories, suchas citrus* ’. Here, by combining chemoinformatics 
and multiphoton imaging in the mouse, we show that both the piriform cortex and its 
sensory inputs from the olfactory bulb represent chemical odour relationships 
through correlated patterns of activity. However, cortical odour codes differ from 
those in the bulb: cortex more strongly clusters together representations for related 
odours, selectively rewrites pairwise odour relationships, and better matches odour 
perception. The bulb-to-cortex transformation depends on the associative network 
originating within the piriform cortex, and can be reshaped by passive odour 
experience. Thus, cortex actively builds a structured representation of chemical 
odour space that highlights odour relationships; this representation is similar across 
individuals but remains plastic, suggesting a means through which the olfactory 
system can assign related odour cues to common and yet personalized percepts. 


In olfaction, perception depends on chemistry®. Chemically related 
odours evoke similar percepts within and across individuals, suggest- 
ing that the cortex harbours a conserved mapping from chemical to 
neural space that organizes information about odour relationships 
to ultimately support perception®’”. Odours are detected by broadly 
tuned receptors expressed by olfactory sensory neurons, the axons of 
which project to the olfactory bulb (OB)*”°. Within the mouse OB, these 
axons are organized into thousands of discrete and spatially organized 
information channels knownas glomeruli, each of which represents the 
tuning properties of an individual odour receptor”. Odour information 
is reformatted by OB circuits before being transmitted to cortex; it is 
not clear whether or to what degree this peripheral transformation 
preserves information about odour chemical relationships”. 

The main recipient of OB afferents is the piriform cortex (PCx)!; axons 
from OB projection neurons are broadly dispersed across the entire 
surface of the PCx, and individual PCx neurons respond to multiple, 
chemically distinct odorants’ ”. These observations suggest that 
neurons in the PCx randomly sample sensory inputs from the OB”. 
Consistent with this possibility, individual odours activate ensembles 
of spatially distributed PCx neurons that lack apparent topographical 
organization with respect to chemical space®”°”!, Feed-forward ran- 
dom network models (which posit stochastic connectivity between 
OB glomeruli and PCx neurons) predict that PCx odour representa- 
tions should be pervasively decorrelated, but that PCx should main- 
tain the pairwise odour relationships present in the OB; these models 
further suggest that cortical codes for odour relationships should be 
invariant across individuals, as peripheral representations of chemical 


relationships are largely determined by the tuning properties of odour 
receptors, which are encoded in the genome?” 4, 

However, in addition to receiving inputs from the OB, PCx neurons are 
linked through a dense web of excitatory interconnections, which sug- 
gests that the olfactory cortex acts as an auto-associative network’. 
Such networks use Hebbian mechanisms to construct cell assemblies 
that encode information about stimulus relationships (such as feature 
similarity or temporal coincidence) through correlated activity. In 
the case of PCx, auto-associative mechanisms are predicted to both 
increase generalization across chemically similar odours, and to render 
cortical odour representations sensitive to passive odour experience, 
thereby reshaping pairwise odour relationships inherited from OB 
inputs. Although the PCx exhibits characteristics that are consistent 
with both random and auto-associative networks, it remains unclear 
whether the cortex systematically encodes information about odour 
chemical relationships; whether any such representation preserves or 
reshapes odour relational information conveyed by the OB; or whether 
cortical odour representations are primarily decorrelated (thereby 
favouring odour discrimination as predicted by random network mod- 
els) or correlated (thereby favouring odour generalization as predicted 
by auto-associative models). 


Cortex encodes odour chemical relationships 

To address these questions, we used multiphoton microscopy in mice 
expressing the fluorescent Ca”* indicator GCaMP6s within the PCx to 
assess neural activity both in the input-dominated PCx layer 2 (L2), and 
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Fig. 1| Systematically probing relationships between odour chemistry and 
cortical odour representations. a, Global, clustered and tiled odour sets (see 
Extended Data Fig. le for odour identities and structures), depicted in principal 
component space (see Methods). Colour indicates functional group associated 
with each odour. The amount of variance spanned by each odour set (of the full 
odour space, grey dots) is indicated. b, Example single neuron responses for 
the clustered odour set, representing the trial-averaged response of single 
neurons (rows) across 22 odours (columns). Rows are sorted using hierarchical 
clustering, with PCx L2 and L3 rasters sorted independently (Methods). c, 
Pairwise odour distances (Pearson’s correlation) for all odour sets based on 
chemical descriptors (Methods). Rows and columns represent individual 
odours sorted using hierarchical clustering (ordering as in Extended Data 

Fig. le). Colour bars indicate functional groups associated with each odour. d, 
Pairwise odour distances based on pooled neural population responses in PCx 
L2 and L3 (Methods), sorted asinc. Pearson’s correlation coefficient between 
the chemical and neural distance matrices reported below each matrix (global: 


in the more associational layer 3 (L3, in which odour responses have 
not yet been described)”* (Extended Data Fig. 1). We took advantage 
of a library of odour descriptors that quantifies thousands of physi- 
ochemical features, such as molecular weight, polarizability and hydro- 
phobicity*”, to rationally design three sets of 22 odours each: a ‘global’ 
odour set, whichincluded structurally diverse odorants well separated 
in odour space; a ‘clustered’ odour set divided into six odour subsets, 
each of which shared functional groups and other structural features; 
and a ‘tiled’ odour set, in which the carbon chain length of a ketone, 
an ester, an aldehyde and an acid was incrementally varied (Fig. 1a, 
Extended Data Fig. 1, Methods). Although each odour set captured 
progressively less chemical variance, by construction individual odours 
in the clustered set (within each of the six subsets) were most closely 
related, whereas odours were separated at intermediate distance scales 
inthe tiled set. We noted that under anaesthesia odour responses in L3 
(and toa lesser extent L2) were attenuated or absent; recordings were 
therefore performed during wakefulness, a state in which L3 neurons 
were considerably more active (Extended Data Fig. 2, Methods). 

All odours evoked selective excitation and suppression, with PCx 
L3 responses being denser, broader and more reliable than those in 
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P<107; clustered: P<10™; tiled: P< 10“); r, (shuffle) obtained by 
independently permuting odour labels for each neuron. Blue boxes highlight 
ketone-ester and ketone-acid relationships between chemistry and PCx L3. 

e, UMAP embeddings of cortical responses to the tiled odour set. Each dot 
represents a population response for one odour presentation (7 per odour), 
colour-coded as ind. f, Fraction of total variance in each mouse (L3 activity) 
attributable to shared across-mouse structure determined by distance 
covariance analysis (Methods). g, k-nearest-neighbour classification of odour 
identity ina held-out mouse using odour distances from other mice. Data are 
bootstrap mean +s.e.m.; grey bars indicate shuffle control on odour labels 
(Methods). (Accuracy is greater in PCx. global: P<10°; clustered: P<10™; tiled: 
P<10™, two-sided Wilcoxon rank sum test.) Datain b, d-g are based onall 
responsive neurons (Methods) pooled by layer across mice (n mice, neurons 
(L2/L3) for global: 3, (854/616), clustered: 3, (867/488), tiled: 3, (427/334)) 

(see Methods for subject-specific statistics). 


L2 (Extended Data Fig. 3). Odours evoked more correlated activity 
across the population of PCx neurons (that is, ensemble correlations) 
than was expected by chance, with greater correlations observed inL3 
compared to L2 (Fig. 1b, Extended Data Fig. 3). These findings raised 
the possibility that correlated odour-evoked responses among PCx 
ensembles systematically reflect chemical relationships among odour 
stimuli. To explore this possibility, correlation distance matrices were 
generated for each odour set based on the physiochemical descrip- 
tors that characterize each odorant (Fig. 1c, Methods). Odours in the 
global set were the least chemically correlated with each other, whereas 
odours in the clustered odour set exhibited substantial block diago- 
nal structure, consistent with subsets of odours sharing key chemical 
attributes. Because molecules in the tiled set are related along two 
chemical axes (for example, heptanone and octanone differ by one 
carbon atom, whereas heptanone and pentyl acetate differ by one 
oxygen atom), the matrix describing these odours exhibited periodic 
on- and off-diagonal structure. 

Visual comparison and quantification demonstrated that odour 
chemistry and neural responses were only weakly related in the global 
odour set; by contrast, cortical odour responses maintained the block 
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Fig. 2 | Correlation structure differs in olfactory bulb and cortex. a, 
Correlation distance matrices for the tiled odour set across all conditions. Top 
left, distances obtained using chemical descriptors. Right, distances based on 
odour responses. Odour sorting asin Fig. 1c. rvalues indicate Pearson’s 
correlation with odour chemistry (Boutons: P< 10”; PCx L2: P< 10; PCXL3: 
P<10™°; Model: P<10; TeLCL2: P< 10'; TeLC L3: P< 10 *?; shuffled Pearson’s 
r=0.0+ 0.063 (mean +s.d.), 1,000 permutations on odour label). ED, effective 
dimensionality (Methods). b, Left, difference between PCx and bouton 
distances ina. Right, difference between PCx and random network model 
distances ina (Methods). c, Pairwise odour correlation distances based on 
neural responses plotted against corresponding chemical distances. 

d, Silhouette scores for clustered population responses (based upon Euclidean 
distances and grouped via k-means clustering) over a range of cluster sizes. 


diagonal physiochemical correlation structure apparent in the clus- 
tered odour set, demonstrating that at close chemical distances, PCx 
represents odour chemical relationships (Fig. 1d). Notably, neural 
responses to the tiled odour set (in which odour relationships are 
organized at intermediate chemical distances) reflected on-diagonal 
chemical relationships, but did not uniformly encode off-diagonal 
relationships. For example, the cortex appeared to emphasize chemical 
similarities between ketones and esters, while de-emphasizing chemi- 
cal similarities between ketones and acids (Fig. 1d, highlighted blue 
boxes). Structured chemical-neural relationships were apparent ona 
trial-by-trial basis, and persisted for several seconds after odour offset; 
as has been observed previously under anaesthesia, no spatial order- 
ing of neurons was observed with respect to odour chemistry during 
wakefulness, consistent with response correlations alone conveying 
information about odour relationships'*”° (Extended Data Fig. 4). 
Both uniform manifold approximation and projection (UMAP) 
embeddings and manifold alignment revealed that cortical odour 
relationships were similar across mice (Fig. le, f); indeed, information 
about pairwise cortical odour distances derived from one mouse could 
be used to predict the identity of aheld-out odorant based upon odour 
distances measured ina different mouse, with better performance 
observed inL3 than L2 (Fig. 1g, Methods). Lasso optimization was used 


ACorrelation (boutons) 


Ranked correlation (boutons) 


Higher values indicate better clustering (Methods). e, Left, pairwise odour 
correlations in boutons and PCx predicted by the feed-forward random 
network model (Methods) compared to observed correlations in PCx L2 and L3. 
Right, probability density distribution of differences between cortical (PCx L2 
and L3) and input (boutons) pairwise odour correlations, superimposed on the 
distribution expected with the model (model versus L3: P<10™?, versus L2: 
P<10™, Kolmogorov-Smirnov test). f, Difference in pairwise odour 
correlations between PCx L3 and boutons (grey dots). Positive values indicate 
greater correlation inthe cortex. Odour pairs are ranked along the x axis from 
least to highest correlation in the bouton data. Short-chain (SC) and long-chain 
(LC) comparisons between ketones (K), esters (E) and aldehydes (A) are 
colour-coded as shown. 


to identify chemical features relevant to driving neural responses in 
each of the odour sets; identified descriptors captured physiochemical 
features such as molecular weight, electronegativity, polarizability and 
hydrophobicity, which suggests that ensemble-level odour representa- 
tions are driven by diverse aspects of odour chemistry (Supplementary 
Table 1, Methods). Identified features that predicted neural activity 
for each odour set also improved the correspondence betweenall the 
other odour sets and their associated neural activity, demonstrating 
that information about odour chemistry gleaned from one experiment 
can be used to predict cortical responses in a different experiment 
carried out using a separate set of odorants (Extended Data Fig. 5a). 


Cortical odour representations reshape bulb inputs 

The selective differences between odour chemical relationships and 
cortical activity apparent in the tiled odour experiment could reflect 
correlation structure present in OB inputs to PCx (consistent with 
feed-forward random network models), or instead could be gener- 
ated by cortex (consistent with auto-associative models). However, 
until nowit has not been possible to quantify odour-evoked responses 
across the complete array of OB glomeruli, which has prevented the 
characterization of correlation structure in bulb inputs to PCx. To 
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address this challenge, we introduced synaptically targeted GCaMP6s 
into projection neurons spanning the OB, and imaged odour evoked 
activity in boutons in PCx layer la (Lla), where they synapse with L2 
and L3 neurons; because the axons and boutons of all OB glomeruli 
are spatially distributed across the PCx”’, each cortical field of view 
effectively samples glomeruli from the entire bulb (Methods, Extended 
Data Fig. 6). 

Odours from the tiled odour set evoked both excitation and sup- 
pression in OB boutons, the responses of which were similar across 
mice (Extended Data Fig. 7). Correlation distance matrices revealed 
that bouton responses reflected information about odour chemical 
relationships (Fig. 2a); in addition, identification of physiochemical 
features that optimized the observed chemical-bouton relationships 
improved predictions of bouton responses to held-out odours as well 
as predictions of cortical responses to the tiled odour set (Extended 
Data Fig. 5b, Supplementary Table 1). Thus, similar to the cortex, OB 
projection neuron boutons encode information about odour relation- 
ships and chemistry. 

However, odour responses in boutons and cortex exhibited distinct 
patterns of correlation with respect to chemistry, with the greatest 
chemical-neural differences observed in PCx L3 (Fig. 2a, b). Although 
the average level of correlated activity was similar in boutons and 
cortex, the distribution of odour-evoked correlations differed, with 
bouton representations exhibiting higher effective dimensionality 
(see Methods); by contrast, odour responses of PCx L3 neurons were 
more clustered, more selectively structured, and exhibited both lower 
effective dimensionality and a wider dynamic range for representing 
close chemical relationships (Fig. 2a, c-e, Extended Data Fig. 7). The 
presence of these structured correlations in part reflected increased 
grouping of closely related odorants, as representations for odours 
nearest each other in chemical space (that is, on-diagonal correlation 
matrix relationships) were more clustered in the cortex than in bou- 
tons (Fig. 2a, f). One exception to this trend was acids, whichas a class 
were correlated in the OB but relatively decorrelated in the cortex 
(Fig. 2a). 

Odour relationships were also reshaped in the cortex compared to 
those in odour chemistry and boutons. UMAP embeddings of data 
from the tiled odour experiment (in which chain length and functional 
group are the main axes of chemical variation) suggested that boutons 
largely organize odour information along a single axis that empha- 
sizes chain length (again, with the exception of acids) (Methods); by 
contrast, odour information in PCx L3 appeared largely organized in 
two dimensions based on functional group (Fig. 3a). Similar functional 
group-based reorganization was observed via hierarchical cluster- 
ing (Fig. 3b). Lasso optimization confirmed that boutons and cortex 
differentially weight chemical features related to chain length and 
functional group (Extended Data Fig. 5c). 

Moreover, several pairwise odour relationships were reorganized in 
PCx on the basis of both chain length and function group. For example, 
in chemical space, short-chain and long-chain odours with different 
functional groups were similarly cross-correlated; in boutons, cor- 
relations between short-chain aldehydes and esters were emphasized 
whereas those among long-chains were diminished; and in PCx L3, the 
opposite pattern was observed, with long-chain aldehydes and esters 
exhibiting stronger correlations and short-chains exhibiting weaker 
correlations (Fig. 3c). Chain-length-dependent cortical reshaping of 
odour relationships was also apparent between aldehydes and ketones. 

These differences in correlation structure suggest that PCx and OB 
boutons differentially encode information about odour identity and 
odour relationships. Linear decoders based on cortical responses 
(particularly from PCx L3) were worse than OB-based decoders at 
predicting odour identity on each trial, consistent with bouton odour 
responses having a higher dimensionality (Fig. 4a). By contrast, cortex 
(particularly PCx L3) was on average better at encoding information 
about odour relationships (Fig. 4b); notably, however, OB was better at 
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Fig. 3| Cortical odour responses reformat odour relationships inherited 
fromthe OB. a, UMAP embeddings for all experimental conditions. Note that 
UMAP emphasizes relationships rather than distances, so these embeddings 
are similarly scaled; without such scaling the points in the model panel, for 
example, would be much more separated than those in the PCx L3 panel 

(Fig. 2a, Methods). b, Hierarchical clustering of neural population responses; 
pvalues indicate clustering similarity to OB boutons (Spearman correlation on 
cophenetic distances between boutons and the other datasets). c, Enlarged 
regions from correlation matrices in Fig. 2a depicting conserved and 
rearranged odour relationships between aldehydes, ketones and esters; inset: 
ratio of correlations between long-chain and short-chain comparisons (each 
dot indicates mean across odour pairs); rvalues indicate Pearson's correlation 
to odour chemistry. Colour code inaand bis asinc. 


generalizing across short-chain ketones, aldehydes and esters whereas 
cortex was better at generalizing across the corresponding long-chains, 
consistent with the observed differences in correlation structure for 
these odour classes (Figs. 3c, 4c). 

Given these differences in information content, we assessed whether 
bulbar or cortical odour codes more closely correspond to perceptual 
odour relationships by measuring the innate perceptual similarity of 
odour pairs through a cross-habituation assay’ (Extended Data Fig. 8). 
Perceptual odour relationships better matched odour responses in PCx 
L3 than those in OB or PCx L2 (Fig. 4d); this closer correspondence to 
PCx L3 was particularly apparent for the short-chain-short-chain and 
long-chain-long-chain comparisons, whose pattern of neural correla- 
tion was inverted in bulb and cortex (Fig. 3c). 


OB-PCx transformation requires associative network 


Together, our observations suggest that the transformation 
between bulb and cortex reflects the combined influence of random 
network-type connectivity (which maintains odour relationships) and 
an auto-associative mechanism (which generally clusters and can selec- 
tively reshape odour relationships). To directly assess the contribution 
of random network-type connectivity to the observed cortical odour 
responses, bouton odour responses were passed through a previously 
established feed-forward model in which simulated PCx neurons sto- 
chastically sample from multiple glomerular inputs” (Methods). Con- 
sistent with previous reports, the model predicted decorrelated cortical 
odour representations, whose pairwise relationships were preserved 
relative to boutons (Fig. 2a). Although cortical responses were in part 
consistent with model output—as many pairwise odour relationships 
were preserved—the model failed to capture the strong correlation 
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Fig. 4| Cortical odour representations generalize across odours, are 
consistent with perception, and can be modified by experience. a, Left, 
schematic depicting a linear support vector machine (SVM) classifier trained 
to identify an odour associated with a held-out neural population response ona 
trial-by-trial basis. Right, decoding accuracy plotted against neural/bouton 
populations of different sizes. b, Decoding analysis to quantify odour 
generalization; each line represents classifier confusion between any odour 
and all other odours, rank ordered by the degree of confusion. c, Decoding 
accuracy of SVM classifiers predicting whether a held-out odourisa 
short-chain or long-chain molecule. The acid block was excluded for this 
analysis. Data are bootstrapped mean +s.e.m. across held-out odours and 
neural/bouton ensembles. (In a-c: tiled odour set, 22 odours; number of mice, 
neurons/boutons for PCx L2/L3 as in Fig. 1b, dg, for boutons, 6 mice /3,160 
boutons. Inb, c:300 units, 100 bootstraps. See Methods for all decoding 
analyses). d, Left, pairwise neural and behavioural odour distances froma 
cross-habituation assay for the tiled odour set (Extended Data Fig. 8); pis 
Spearman correlation coefficient. Black line indicates regression fit (mean+ 
95% confidence interval, 1,000 bootstraps). Black circles are mean+s.e.m 
across mice (n>3 for each comparison). Right, coefficient of determination 
(R?) based on short-chain-short-chain and long-chain-long-chain or all 
comparisons (median + 66th confidence interval, 1,000 bootstraps. n=26 
odour triplets; 122 mice across all conditions, see Methods for behavioural 
distance and odour identities). e, Left, probability density estimates of 
cell-wise class preference index for naive and passive odour exposure 
conditions, for neurons responding to at least one short-chain ketone or 
aldehyde (Methods). Right, example z-scored fluorescence (and preference 
index) from neurons tuned to short-chain ketones (cell 1), short-chain 
aldehydes (cell 2), or both (cell 3). Grey bars indicate odour onset. f, Pairwise 
odour distances in PCx L3 from odour-naive and odour-exposed mice. Passive 
exposure to the target mixture (short-chain ketones and aldehydes) 
specifically increased similarity between ketones and aldehydes, but not 
between control odour pairs (short-chain ketones vs long-chain aldehydes and 
short-chain esters vs short-chain acids (Ac)) (Extended Data Fig. 10). *P< 0.002; 
NS (not significant) middle: P= 0.62; right: P= 0.45, two-tailed independent 
t-test. Number of mice/neurons for naive: 3/334; exposed: 3/742. 


structure present in cortex or the selective rewriting of pairwise odour 
relationships (Figs. 2a—e, 3). 

To evaluate the relative influence of auto-associative mechanisms 
oncortical odour representations, we used an adeno-associated virus 
(AAV)-based method to express tetanus toxin light chain (TeLC) within 


PCx neurons; this approach blocks synaptic transmission and causes 
PCx to behave as if it largely receives feed-forward inputs” (Extended 
Data Fig. 9). After attenuation of the associative network, the tuning of 
single neurons to odours broadened, response densities rose, and odour 
correlations increased, consistent with known network-dependent 
recruitment of inhibition” (Extended Data Fig. 9). Crucially, after TeLC 
infection, cortical odour relationships more closely resembled those 
present in odour chemistry and OB axonal boutons as assessed via 
correlation matrices, UMAP clustering and hierarchical clustering; 
for example, the cortical restructuring of short-chain and long-chain 
odour relationships was abolished, as was the decorrelation among 
acids (Figs. 2a—c, 3). 

Auto-associative networks are predicted to influence correlations 
among odour representations to reflect the coincidence of stimuli 
in the world; although reward-based experiments have revealed 
task-dependent changes in cortical odour relationships*””, it has not 
yet been demonstrated that cortical odour correlations are sensitive to 
passive odour experience!”’. We therefore repeatedly exposed mice to 
amixture of short-chain aldehydes and ketones (the PCx L3 represen- 
tations of whichare relatively decorrelated) (Fig. 2a) before assessing 
cortical responses to the tiled odour set. Mixture experience specifi- 
cally increased the cortical correlation between individual aldehydes 
and ketones, and recruited single neuron tuning curves that reflected 
generalized responses to these specific odour classes (Fig. 4e, f, 
Extended Data Fig. 10). These observations demonstrate that corti- 
cal odour relationships can adapt to the statistics of the experienced 
odour environment. 


Discussion 


The olfactory system must synthesize information about chemical 
features to generate organized odour representations that support 
discrimination and generalization. Here we show that both OB bou- 
tons and cortex explicitly represent odour chemical relationships. 
The observation that many pairwise odour relationships are encoded 
similarly in these two brain areas is consistent with random connectiv- 
ity models, which propose that PCx neurons stochastically sample 
glomeruli to generate a systematic population-level representation 
of odour chemical space”. 

However, cortex differs from the bulb in two key respects. First, 
PCx better clusters odour representations, enabling it to prefer- 
entially signify odour relationships. Second, cortex reconfigures 
information about odour relationships inherited from the bulb—the 
cortex does not simply pool and normalize its inputs, but instead, 
in anetwork-dependent manner, actively builds an odour space to 
emphasize certain odour relationships and de-emphasize others; 
this re-writing is sensitive to odour exposure, which can recruit new 
single neuron tuning properties and modify odour relationships. 
The olfactory system, therefore, transforms a chemical feature space 
into acortical space that represents stimulus relationships through 
correlated activity; the structure of this space reflects information 
inherited from the sensory periphery, the transformation imposed 
by cortical circuits, and the effects of sensory experience. The corti- 
cal grouping of representations for both structurally and temporally 
related odours suggests a mechanism for generalization across natural 
odour sources, which tend to emit related odour chemicals; in prin- 
ciple, similar mechanisms could assign coincidentally encountered 
but structurally distinct odours to shared semantic categories” **. 
Future in vivo experiments will be required to understand how the 
intrinsic properties of PCx neurons and the associative network, which 
targets both excitatory pyramidal and inhibitory neurons, collaborate 
to transform and organize odour representations. 

In nearly all our analyses, the correlation structure of L2 odour 
representations was intermediate between that observed in boutons 
and L3, which may reflect relative differences in the prominence of 
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bulb inputs in L2 and associational connectivity in L3'”°. Because the 
network that interconnects PCx neurons also sends centrifugal projec- 
tions to the OB, it is likely that under physiological circumstances this 
network influences both bulb and cortical representations of odour 
relationships””’*>°, Although the TeLC experiment demonstrates 
that network activity originating in PCx is required for the bulb-cortex 
transformation, PCx is recurrently connected to an array of higher 
olfactory centres that may also have a role in shaping odour relation- 
ships’. Notably, neural representations in PCx L3 more closely match 
perception than those present in bulbar inputs, suggesting a functional 
hierarchy amongat least some interconnected olfactory brain areas. 

Relationships among cortical odour representations depend on 
chemical distances, such that at close distances information about 
chemical relationships is largely maintained, at large distances cortex 
decorrelates odour representations, and at the intermediate distances 
captured by the tiled odour set the olfactory system sculpts relational 
representations for odours in a manner that respects but reshapes 
chemical relationships. Our findings with the clustered odour set are 
reminiscent of previous work demonstrating that similar odour mix- 
tures recruit overlapping ensembles of PCx neurons, althoughin those 
experiments chemical distances were not quantified®. Although here we 
take advantage of functional groups and chain lengths to systematically 
alter odour distances at intermediate scales, many distinct chemical 
features differentially contribute to odour representations in both the 
bulb and cortex*. The finding that treating odour chemicals as buckets 
of physiochemical features—rather than organizing information about 
chemistry along arbitrary dimensions—identifies structured chemical- 
neural-perceptual relationships is consistent with the longstanding 
model that the odour receptor repertoire broadly samples chemical 
feature space”, 

The relational information present in PCx cannot, in and of itself, 
assign a given odorant to its unique odour quality: the mapping 
observed here potentially explains why lemon characteristically smells 
similar to orange, but fails to explain why lemon smells like lemon. In 
particular, itis unclear how cortical information about odour relation- 
ships might be aligned to enable lemon odour to evoke a similar percept 
across individuals. We propose that relational information in PCx (and 
possibly other olfactory areas) is translated into invariant information 
about odour quality by using universal points of reference, muchlikea 
compass canbe used to orient a paper map to the cardinal directions”. 
These points of reference may arise from invariant properties of specific 
odour receptors, or from hardwired circuits in olfactory areas such as 
the accessory olfactory nucleus or the cortical amygdala*”, Alter- 
natively, reference points could be learned from shared experience; 
inthis model exposure to stereotyped odours (for example, amniotic 
fluid, mother’s milk, faeces, urine or food) or common objects (such as 
actual lemons) could orient chemical-neural mappings along (largely) 
invariant axes across individuals. Further work aimed at understand- 
ing the interaction between fixed and flexible features of olfactory 
circuitry will be required for a full account of the relationship between 
chemistry, experience and perception. 
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Methods 


Ethical compliance 
All experimental procedures were approved by the Harvard Medical 
School Institutional Animal Care and Use Committee (protocol number 
04930) and were performed in compliance with the ethical regulations 
of Harvard University as well as the Guide for Animal Care and Use of 
Laboratory Animals. 


Mice 

Acute imaging of PCx was performed in 8-16-week-old C57/BL6J 
(Jackson Laboratories) male mice. Imaging of cortical neurons was 
performed in mice harbouring the Vgat-ires-cre knock-in allele (Jackson 
Stock no. 028862) and the ROSA26-LSL-TdTomato cre reporter allele 
(Jackson Stock no. 007914); imaging of boutons was performed in mice 
containing the 7bx21-cre allele (Jackson Stock no. 024507, gift from 
C. Dulac). TeLC-dependent elimination of cortical excitatory transmis- 
sion and subsequent imaging was performed in Emx1-IRES-Cre mice 
(Jackson Stock no. 005628). Male mice were group-housed before viral 
delivery of GCaMPé6s and housed singly for 1-3 weeks after injection. 


Viral constructs 

To generate pAAV-hSyn-FLEX-TeLC-P2A-dTom, pAAV-hSyn-FLEX- 
TeLC-P2A-EYFP (a gift from Bernardo Sabatini) was digested with 
Ascl and Nhel to remove enhanced yellow fluorescent protein (eYFP). 
Agene fragment (synthesized by IDT) containing dTomato withthe SV40 
nuclear localization signal was cloned into the TeLC backbone viaiso- 
thermal assembly (NEB HIFI E2621). AAVDJ-hSyn-FLEX-TeLC-P2A-dTom 
virus was produced by Vigene Biosciences, with a titre of 1.5 x 10" 
genome copies per ml. AAV PHP.eB hSynapsin1-FLEx-axon-GCaMP6s 
virus was produced as previously described*. In brief, HEK293T cells 
were co-transfected with pAAV-hSynapsin1-FLEx-axon-GCaMP6s (a gift 
fromL. Tian, Addgene plasmid 112010), PHP.eB rep-cap (a gift fromthe 
Viviana Gradinaru Lab/Clover centre), and adenovirus helper plasmids. 
After 5 days, viral particles were collected and then purified via iodix- 
anol gradient ultracentrifugation. The virus was titred via quantitative 
PCR (qPCR); the titre of all batches was between 2 x 10-3 x 10" viral 
genomes per ml. 


Stereotaxic viral delivery 

Vgat-ires-cre; ROSA26-LSL-TdTomato male mice were injected with an 
AAV expressing the genetically encoded activity indicator GCaMP6s 
(AAV1.CAG.GCaMP6s.WPRE.SV4O, Penn Vector Core). Injections were 
targeted to posterior PCx using Allen Brain Atlas coordinates: medial- 
lateral (ML): -4.2, anterior—posterior (AP):—1.09, dorsal-ventral (DV): 
—4.25, from the dura. FLEX-TeLC-P2A-dTom was targeted to anterior 
and posterior PCx of both hemispheres of Emx1-IRES-Cre mice at AP: 
0.39 and -1.0; ML: -3.51 and —4.2; DV: -4.4 and —4.1. To uniformly infect 
olfactory bulb projection neurons, AAV PHP.eB FLEX-axon-GCaMP was 
delivered intravenously via retro-orbital injections in Tbx21-Cre trans- 
genic mice (Jackson Stock no. 024507), which express Cre recombinase 
in OB projection neurons. 

Full-titre viruses (500-1,000 nl) were delivered to cortex at nls“ using 
aNanoject II dispensing pump (Drummond Scientific). GCaMP6s-injected 
mice were imaged 1-3 weeks after delivery. In TeLC experiments, GCaMP6s 
was injected at ML: —4.2, AP: -1.09, DV: —4.25, from the dura, 2-3 weeks 
after TeLC delivery. Mice were imaged 3-5 weeks after TeLC delivery. 
Imaging of OB projection neuron axon terminals in Lla of PCx was per- 
formed 3-5 weeks after retro-orbital injection. To assess the influence of 
passive odour exposure on cortical representations (see ‘Passive odour 
exposure’), odour exposure to mixtures was initiated 1-2 weeks after viral 
delivery; as with imaging in odour-naive mice, imaging in odour-exposed 
mice was performed 3-4 weeks after injection. 

Uniform infection of L2 and L3 cortical neurons by AAV-TeLC across 
the rostro-caudal extent of PCx was confirmed histologically. To 


validate TeLC dependent inhibition of cortical excitatory synaptic 
transmission, odour-evoked single-unit activity was compared between 
control and infected mice as previously described”’. 


Surgical approach and craniotomy 

We developed a surgical preparation compatible with PCx imaging 
during wakefulness and semi-paralysis; this preparation is similar to 
those used in the past to explore in vivo neural responses without the 
use of general anaesthesia (but with effective analgesia) during the 
experiment” *". Before exposing PCx, mice were anaesthetized with 
isoflurane, and head-fixed with dental cement toa rotating headpost. 
The PCx was then accessed from the ventral surface of the mouse skull 
through surgical resection of the zygoma, the mandible, and associated 
musculature and fascia. A2 mm craniotomy overlying the PCx was made 
using a dental drill and secured with a custom-shaped cranial window. 

To ensure that the mouse was free of pain and discomfort during 
wakefulness and semi-paralysis, full hemifacial analgesia was provided 
by performing a complete trigeminal nerve block. This procedure is 
designed to abolish sensation around the surgical exposure, as well as 
the ipsilateral oro-facial region encompassing the entire dorsoventral 
extent of the head and extending from the nostril to the neck. The junc- 
tion of the four branches of the trigeminal ganglion was readily identi- 
fied at the external pterygoid ridge, which was rendered accessible 
when the mandible was removed. A1-5 ll, 0.2-1.0 mg kg“ dose of bupiv- 
acaine was injected directly into the stalk of the trigeminal nerve bundle 
using a calibrated micropipette mounted ona micro-manipulator. By 
infusing bupivacaine solution proximally to the trigeminal ganglion, 
distribution along all trigeminal branches, including the mandibular, 
ophthalmic, and infraorbital branches, was ensured. To verify that 
the block infiltrated the entire nerve bundle, each injection was sup- 
plemented with the fluorescent lipophilic contrast agent Dil used to 
identify myelinated nerve fibres, owing to its lipophilic, infiltrating 
nature*. By including Dil in the block solution and monitoring its 
diffusion through the nerve adjacent to the injection site, proper micro- 
pipette placement directly inside the nerve bundle was confirmed. 
Successful Dil injections were characterized by uniform distribution 
of dye through the trigeminal bundle proximal to the injection site. 
Incases of insufficient labelling of the nerve bundle, several injections 
were administered until the entire nerve was labelled by visual inspec- 
tion. This procedure was extensively evaluated through measurements 
of heart rate (which revealed no signs of distress)? (Extended Data 
Fig. 2) and by systematically probing the depth of analgesia through the 
use of needle pricks along the entire dorso-ventral and rostro-caudal 
portion of the head ipsilateral to the injection site. 

After completion of the surgical exposure, induction of analgesia, 
installation of paralytic infusion and retro-nasal sniffing lines, as well 
as placement of the EEG electrode (see ‘Semi-paralysis’, ‘Artificial sniff- 
ing’ and ‘EEG for assessing brain state’ sections), isoflurane anaesthesia 
was discontinued and mice were transferred to the imaging set-up 
equipped with custom-built sniff generator, oxygen respirator, as well 
as a peristaltic pump for paralytic infusion. 


Semi-paralysis 

Mice were provided witha continuous infusion of alow dose of the muscle 
relaxant pancuronium bromide into the jugular vein during the imaging 
phase of the experiment*®. After calibration, the final dose and infusion 
speeds were chosen tobe 0.024. ng kg™per10 min. Atthis dose, mice experi- 
encealoss of righting reflex, but maintain diaphragmatic contractions and 
toe-pinch reflexes. Because this dose was chosen as to minimize paralysis 
(which is not required for analgesia), if movement was observed in the 
experimentintermittent pushes of pancuronium were provided to ensure 
motion-free imaging. We refer to this preparation as being inthe condition 
of ‘wakefulness’ (see ‘EEG for assessing brain state’) as opposed to ‘awake’ 
given thatthe mice are incapable of gross movements and cannotactively 
sample odours because of the ventilator (see ‘Artificial sniffing’). 
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Artificial sniffing 

To control for potential differences in odour coding due to changes 
in odour sampling, the mouse’s sniffing rhythm was replaced with an 
8-Hz fixed inspiration/expiration cycle synchronized in time to odour 
presentation; this cycle rate mimics known sniffing rates during active 
odour sampling****. We adopted a previously developed method in 
which a cannula was placed into the nasopharynx via the trachea and 
subsequently attached to a solenoid valve which draws air bidirection- 
ally across the nasal epithelium**. The tube was secured to the trachea 
with a pair of nylon sutures and doused with silicone elastomer for 
further stability*’”. The distal portion of the tube was then coupled to 
acomputer driven solenoid valve and a vacuum line, providing 50-ms 
pulses of suction every 75 ms at a flow rate of 100 ml min”. 


EEG for assessing brain state 

Anaesthesia is thought to induce a brain-state similar to slow wave sleep 
that is characterized by large-amplitude fluctuations in the 0.5-4.0 
Hzrange and the absence of high-frequency activity from 40-100 Hz, 
which is typically present during wakefulness or behavioural engage- 
ment. Because power in the slow and fast frequency bands of the 
EEG is anti-correlated across these brain states, their ratio has been 
traditionally used to assign an absolute value to the arousal state of 
the mouse***’. To compare changes in brain state between wakeful- 
ness and anaesthesia and associated changes in odour representation, 
after completion of awake imaging, some mice were subsequently 
anaesthetized with a ketamine dose of 50 mg kg‘ and 0.5 mg kg! of 
medetomidine delivered together intraperitoneally, and the imaging 
session was immediately repeated. For all experiments, EEG activity 
was recorded using a silver wire inserted into the dorsal anterior PCx 
via a 0.5-mm craniotomy. A grounding wire was placed into the con- 
tralateral cerebellum. This signal was amplified using an AM Systems 
1800 amplifier and digitized with a National Instruments PXie-6341 
acquisition card. Signals were detrended and bandpassed (0.5-500 
Hz) before computing the EEG power ratio. 


Odour space design 

A major goal of this study was to rationally design odour sets such 
that chemical similarities and differences between odorants in each 
odour panel could be explicitly titrated. As described previously, to 
describe odour space we took advantage of 2,584 molecules com- 
monly used in the flavours and perfume industries from http://www. 
thegoodscentscompany.com/”. A large fraction of these molecules 
are odorous, in that they are less than 300 Da in size and sufficiently 
hydrophilic and volatile to readily access the environment of the nasal 
epithelium. This collection contains structurally diverse molecules that 
vary in carbon chain length, weight, polarizability, hydrophobicity, 
cyclicity, branching, constituent functional groups, and other chemical 
attributes. To characterize the physiochemical features of the odours 
within this large odour collection, we took advantage of a database of 
3,705 statistical metrics designed to quantify different molecular physi- 
ochemical properties (Dragon, KODE Inc.), including those related to 
molecular weight, volume, ionization potential, and so on. Using this 
descriptor database, each of the 2,584 collected molecules were rep- 
resented by a vector containing 3,705 values (in which each value is a 
quantitative description of a specific physiochemical feature), thereby 
constituting an odour chemical space where odour similarity can be 
expressed as the Euclidean, cosine or correlation distance between any 
two molecules>°°. Of these features, 2,522 are quantified in the Dragon 
database as continuous variables (for example, molecular weight) 
and 985 were binarized or categorical (for example, the presence or 
absence of aN atom). The collection of these inter-odour distances 
(which holistically capture the quantified physiochemical differences 
between each odour pair) can be converted into correlation distance 
matrices (see ‘Chemical and activity distance’). 


Odour selection 

Three (‘global, ‘clustered’ and ‘tiled’) distinct odour sets were identi- 
fied, each consisting of 22 odorants. The global odour set contained 
structurally diverse molecules that span the entire odour space. The 
clustered odour set consisted of 6 groups of 3-4 molecules, such that 
all the odours within each group share a chemical functional group (as 
well as other common chemical features); these groups were designed 
such that the odours that belong to each group were maximally sepa- 
rated from the odours belonging to all other groups. The tiled odour set 
included closely related aliphatic molecules that systematically varied 
along two dimensions; the first was the number of carbon atoms inthe 
chain, and the second was the particular functional group attached 
to the carbon chain (that is, aldehydes, esters, ketones and acids, 
all of which are related to each other). Odour selection for the first 
two odour sets was performed with stochastic optimization (see 
‘Simulated annealing’) to prevent human-induced biases in odour 
set design. The cost function for the global odour set was designed 
to maximize separation between all 22 odours (by maximizing the 
minimum pairwise distance among selected odours). For the clustered 
odour set, within-group similarity and between-group dissimilarity 
was maximized by using the silhouette coefficient as the cost function. 


Odour delivery 
A 23-valve olfactometer that can deliver up to 22 odorants was used 
to present odours (Island Motion). The 23rd valve was used to deliver 
a blank stimulus (no odour) between odour presentations. Custom 
Arduino software was used to control valve opening and closing, 
thereby enabling switching between odour vials and the blank vial. 
This software also controlled the output of two mass flow controllers 
(MFC). The first MFC delivered a constant carrier flow at 0.8 | min™ 
of purified air into a common channel; the second MFC supplied a 
constant flow at 0.21 min” of clean air that was injected into an odour 
vial (see below) and then merged with the carrier flow 1 inch (2.54 cm) 
in front of the mouse’s nose. A larger exhaust fan drew air from the 
imaging cage that enclosed the rig to prevent cross-contamination. 
Monomolecular odours were diluted in di-propylene glycol (DPG) 
according to individual vapour pressures obtained from www.thegood- 
scentscompany.com, to give a nominal concentration of 500 ppm. 
This vapour-phase concentration was further diluted 1:5 by the carrier 
airflow to yield 100 ppm at the exit port. Odour presentations lasted 
for two seconds and were interleaved by 30 s of blank (DPG) delivery. 
The order of presentation of odours was pseudo-randomized for each 
experiment, such that on any given trial, odours were presented once 
inno predictable order. Each odour was presented 7-10 times in each 
experiment. 


Two-photon calcium imaging 

High-speed volumetric imaging was performed using a 16-kHz res- 
onant galvo-regular galvo pair (Cambridge Technologies) housed 
in a custom-designed microscopy rig equipped with 2-inch optics. 
Acquisition was performed with a large working distance Nikon 16x 
objective (NI6XLWD-PF, 0.8 NA, 3mm WD) mounted ona high-speed 
piezo actuator (nPoint 400). A Chameleon laser (Coherent) tuned to 
930 nm delivered 50-120 mW of excitation power at the front end of 
the objective. Emitted fluorescence was detected using Hamamatsu 
H10770PA-40 PMTs. Scanimage 5 was used for hardware control and 
data acquisition. 

For imaging neuronal cell bodies, acquisition volumes spanned 
210 pm in the Z axis across PCx L2 and L3. Volumes were split into 6 
optical slices each spanning 35 pm of cortex. Volumes were positioned 
such that two slices resided in L2 and four slices resided in L3. This 
allowed us to monitor similarly sized populations of neuronsinL2 and 
L3 given the approximately threefold lower cell density of L3 in poste- 
rior PCx*. We typically discarded a single optical slice that spanned the 


boundary between layers to avoid any cross-contamination between 
layers. For axonal imaging, a single plane was acquired in PCx L1a at 
60 Hzand subsequently downsampled by averaging to match the neural 
acquisition rate. 

Note that our imaging fields are in the most anterior portion of 
posterior PCx. The degree of associational connectivity is known to sys- 
tematically vary across the anterior-posterior axis of the PCx, with the 
least associational and most feed-forward connectivity anteriorly’. 
We chose toimage in the ‘middle’ of the PCx both because of anatomical 
constraints in our imaging field in the two-photon configuration, and to 
ensure the representations we probed would include both feed-forward 
and associational connectivity. We would expect that if we imaged 
more anteriorly, we would observe representations that were progres- 
sively more ‘bulb’-like (given the relative predominance of inputs), and 
conversely that posterior PCx would deviate more strongly from the 
bulb (given the relative predominance of associational connectivity); 
because of surgical constraints, addressing this possibility will require 
the future development of alternative means of accessing anterior and 
posterior PCx both pre- and post-synaptically. 


Data inclusion criteria 

For experiments involving the global, clustered and tiled odour sets in 
odour-naive mice, data were analysed from three mice per odour set. 
For the passive odour exposure experiment, data were analysed from 
three mice. For bouton imaging, data was analysed from six mice. All 
mice that satisfied the following pre-determined criteria were included 
in the study: imaging volumes spanning both piriform cortical L2 and 
L3 (Lla for boutons) could be imaged continuously for the duration of 
the experiment; in each cortical layer, at least 150 GCaMPé6s-labelled 
neurons could be identified (500 axonal boutons for axonal imaging); 
odour-evoked activity persisted over the course of the entire imaging 
session; and field-of-view drift and motion artefacts could be fully 
corrected with post hoc image registration. Given the nature of this 
population imaging study, study sample size was not pre-determined, 
the experiments were not randomized, and the investigators were not 
blinded to study conditions. 


Signal extraction 

Detection of regions of interest (ROIs), segmentation, and extraction 
of fluorescence signal was performed using the Suite2p software™. 
This package implements image registration, neuropil fluorescence 
correction and fluorescence source detection from spatially overlap- 
ping ROIs. To accommodate differences in ROI size between axonal 
boutons and somata, the expected ROI size paramater was set to 5 um 
for axonal boutons and 12 pm for somata. 


auROC-based detection of odour responses 

Analysis was only performed on neurons that responded, ina statistically 
significant manner, to at least one odorant. To identify such neurons, we 
computed the area-under-the-receiver-operator-curve (auROC) statistic 
for each cell-odour pair. The auROC metric represents the probability 
that aneuron’s response, chosen at random from all presentations of 
the same odour, will be ranked higher than a randomly chosen sham 
response obtained using baseline activity. A value of 0.5 indicates no 
difference between the activity of a neuron during baseline and odour 
presentation. A value of lindicates a perfectly distinguishable excita- 
tory response, whereas a value of O indicates a perfectly distinguishable 
suppressed response. For a single neuron and all presentations of a 
single odorant, the classifier was provided with the mean fluorescence 
obtained from 2-s time windows immediately flanking odour onset. A 
null distribution of auROC values for each cell-odour pair was con- 
structed by randomly permuting the identity of the odour and baseline 
periods on each presentation. This procedure was repeated 1,000 times. 
The actual auROC value was deemed significant if it resided outside the 
1-99th percentile of the null distribution. Neurons that did not display 


a significant response to any odours, according to auROC analysis, 
were excluded from all subsequent analysis. Of those neurons imaged, 
the fraction of retained neurons (and the absolute number of neurons 
in each data set) were: global L2 = 854 neurons, 82%; global L3 = 616 
neurons, 89%; clustered L2 = 867 neurons, 87%; clustered L3 = 488 neu- 
rons, 85%; tiled L2 = 427 neurons, 59%; tiled L2 =334 neurons, 52%; TeLC 
tiled L2 = 435 neurons, 51%; TeLC tiled L3 = 590 neurons, 68%; boutons 
tiled = 3,160 boutons, 68%. Note that the number of neurons deemed 
responsive by auROC analysis is proportional to the extent to which 
each odour set captured chemical diversity, with the greatest number 
of responsive neurons observed in the global odour set, and the fewest 
observed inthe tiled odour set. This distribution of responsive neurons 
(between 51% and 89%, depending on the chemical diversity in each 
odour set) is consistent with previous work characterizing response 
densities and tuning breadths in PCx. 


Gaussian mixture model for response type clustering 

Clustering of cell-odour response types was performed for visualiza- 
tion purposes only. Trial-averaged response time courses spanning the 
odour presentation period were dimensionally reduced by principal 
component analysis (PCA) to capture 90% of the variance in the data and 
served as the input to a Gaussian-mixture model, with the optimal num- 
ber of clusters was assessed using the Bayesian information criterion. 


Lifetime and population sparseness 

Lifetime sparseness is a metric reflecting the tuning breadths of indi- 
vidual neurons, with neurons specifically tuned to small numbers of 
stimuli exhibiting a lifetime sparseness of close to 1; population sparse- 
ness is a metric reflecting the density of responses among a population 
of neurons to a set of stimuli, with less dense responses (that is, fewer 
neurons or boutons responding toa stimulus set) exhibiting a population 
sparseness of close to 1. To determine the odour-selectivity of aneuron, 
the lifetime sparseness metric was computed as previously described: 


1- (2) #) nF) 


lifetime sparseness = 


where r,is the positive odour-evoked change in fluorescence to an 
odour/relative to baseline and averaged over multiple odour presenta- 
tions, and Nis the number of odours (22 in all odour sets). Inhibitory 
responses were zeroed (for this analysis only). Lifetime sparseness 
reflects the kurtosis of the tuning profile of aneuron and ranges from 
Oto1. Highly peaked, narrow tuning profiles yield values close to land 
represent neurons that respond strongly and selectively to few odours. 
Values close to 0 indicate equal responsiveness to a large fraction of 
the odour set. Population sparseness for each odour was calculated 
using the same formula used for lifetime sparseness, but in this case, 
jindexes a neuron instead of an odour. 


Signal and ensemble correlations 
The extent to which any two neurons have similar odour preferences can 
be assessed by computing the Pearson’s product moment correlation 
between their trial-averaged odour response profiles (tuning curves). 
This is typically referred to as a ‘signal’ correlation. The tuning curve of 
each neuron was represented as a vector containing Nelements, where 
Nis the number of odours in the stimulus set. Each entry in this vector 
corresponds to the odour-evoked change in fluorescence relative to 
baseline and averaged over all presentations. For each neuron, responses 
across odours were z-scored. Populations were defined as all neurons 
that responded to at least one odour according to the auROC analysis. 
We also wished to compute the similarity in odour responses exhib- 
ited by different odour-evoked ensembles of PCx neurons or boutons. 
We refer to this here as ‘ensemble’ correlation. In these analyses, we 
computed the Pearson’s correlation between the population responses 
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to every pair of odours in our panel. The population response of each 
odour was represented as a vector containing Nelements, where Nis the 
number of neurons/boutons in the population. Each entry inthis vector 
represents the trial-averaged response of each single neuron/bouton to 
the corresponding odour. For each neuron/bouton, responses across 
odours were z-scored. Populations were defined as all neurons/boutons 
that responded toat least one odour according to the auROC analysis. 


Chemical and activity distance 

Pairwise odour distances (1— Pearson’s r) inneural activity space were 
computed between odour vector pairs where each matched vector 
entry corresponded to asingle neuron’s trial-averaged response to the 
corresponding odour. This procedure was applied to neural popula- 
tions from individual mice or to pseudo-populations of neurons built 
by pooling all responsive neurons from all mice for each layer and 
odour set. In chemical space, correlation distances between odour 
pairs were computed identically, except each vector entry (matched 
across odours) represented the odour-specific value assigned by a 
physiochemical descriptor. For presentation purposes, distance matri- 
ces were sorted using hierarchical clustering. For the global odour set, 
all odours were sorted collectively. For the clustered odour set, odours 
were sorted within each functional class first followed by sorting on 
functional classes. For the tiled odour set, functional group classes 
were sorted, but odours within each class were ordered according to 
increasing chain length. Row and column ordering of all activity and 
chemistry distance matrices is preserved across figures. Note that 
for all correlation analyses, both inhibitory and excitatory responses 
were included. 


UMAP embedding 

For visualizing odour relationships in neural data, population 
responses were embedded in two dimensions using UMAP*. Selec- 
tion of optimal embedding settings was accomplished by minimizing 
the mean-squared error between correlation distance matrices built 
from data projected onthe UMAP dimensions and those corresponding 
to theinput data. Simulated cortical responses from the feed-forward 
models were processed ina similar manner. Because UMAP imposes an 
arbitrary rotation on projected data, each embedding was aligned toa 
reference using the orthogonal Procrustes transformation. For embed- 
dings of pseudopopulation data for the tiled odour set across boutons 
as well as neural and simulated cortical data, bouton data served as the 
reference. For aligning embeddings obtained from individual mice, 
the orthogonal Procrustes transform (rotation and reflection only) 
was performed iteratively across mice in a pairwise manner”. Note 
that these embeddings are meant to visualize odour relationships 
(and are complemented by quantitative metrics); pairwise relation- 
ships cannot be changed by any of the rotations used herein to align 
embeddings to each other. 


Distance-based nearest neighbour decoding and classification 
To test whether PCx odour relationships are invariant across individ- 
uals, we asked whether we could identify any given odour from one 
mouse based upon the pairwise odour relationships observed in other 
mice. To decode odour identity based on odour relationships, nearest 
neighbour classifiers were trained on odour distances from two mice 
and tested on odour distances obtained from a held-out mouse, such 
that each mouse was tested once. For each such classifier, 100 bootstrap 
iterations were performed. For each iteration, correlation distance 
matrices were constructed using arandomly sampled neural ensemble 
containing 50 neurons. In each condition, each distance matrix rep- 
resented all pairwise correlations between trial-averaged population 
responses (see ‘Chemical and activity distance’), such that any given 
training or testing odour was represented as a vector containing 21 
pairwise distances. For asingle run of the classifier, reported accuracy 
represents the fraction of odours that were correctly identified. 


Distance covariance analysis 

Distance covariance analysis (DCA) belongs to aset of statistical meth- 
ods that seek to identify shared dimensions of variability between 
two different data sets. DCA—an extension of canonical correlation 
analysis—was developed for identifying related dimensions of activity 
across two or more populations of neurons®. To measure the similar- 
ity of cortical odour relationships between individual mice, DCA was 
performed on response data (neurons by odour-trial) from all individual 
mice exposed to the same set of odours (code supplied with the refer- 
ence). The output consisted of a set of orthogonal dimensions (one 
set per mouse) and associated DCA statistics ranked from highest to 
lowest contribution to common activity between individuals. Each 
dimension was evaluated for significance by permutation testing. 
The null distribution of DCA statistics was constructed by shuffling 
the sequence of odour responses across all neurons in each mouse to 
destroy between but not within-mouse relationships. A dimension was 
deemed significant if its associated statistic was higher than the 95" 
percentile of the null distribution built from 100 permutations. Three 
to six dimensions were typically retained. Because DCA is not determin- 
istic, this procedure was subjected to 100 independent restarts. The 
reported results correspond to the best modelling run. The fraction of 
an individual’s neural variance that could be explained by the shared 
embedding was determined by calculating the total accounted variance 
after regressing each neuron’s activity on the set of DCA dimensions. 


Silhouette coefficient 

The degree of clustering in odour correlation distance matrices was 
evaluated using k-means clustering and the silhouette coefficient. Cor- 
relation distances were computed between trial-averaged population 
responses to all 22 odours in the tiled odour set. Correlation distance 
matrices containing all pairwise odour distances were projected onto 21 
principal components and subjected to k-means clustering. For this set 
of labelled data, the silhouette coefficient assigns a single value ranging 
from —1 (overlapping diffuse clusters) to 1 (compact, well-separated 
clusters) that represents the average silhouette score computed onan 
odour-by-odour basis: for an odour i, its score, S;,is defined as (b;— w;)/ 
max(w,, b;), where w, is the average Euclidean distance between odour 
iand all other odours with the same class label, and 5; is the average 
distance between odour iand all odours in the next nearest class. Quali- 
tatively similar results between experimental conditions were obtained 
by running k-means and cluster evaluation directly on full population 
response data or by obtaining k-means labels from full population data 
and computing the silhouette coefficient using these labels on PCA 
embeddings of correlation distance matrices. 


Effective dimensionality 

Effective dimensionality (ED) of a population of neurons, a quantity 
reflecting the number of principal components required to capture the 
odour-evoked neural variance, was defined as previously described”. 
In brief, for each experimental condition and for model cortical activ- 
ity, ED was quantified from trial-averaged population responses (neu- 
ral data only) after mean-centring units across odours. ED reflecting 
variance in similar ensemble sizes were calculated as averages across 
randomly chosen, 300-unit ensembles (100 bootstraps). 


Hierarchical clustering 

Dendrograms depicting the reconfiguration of odour relationships 
across boutons, PCx, TeLC, and the feed-forward random connectivity 
model PCx were constructed directly from correlation distance matri- 
ces associated with each experiment. First, each correlation matrix 
was projected, using PCA, onto K dimensions, where K is the number 
of dimensions required to explain 95% of the variance in correlation 
distances. The resulting embedding expresses the contributions of 
each odour to the prominent similarity or dissimilarity modes in the 


original correlation distance matrix. Dendrograms were built by hier- 
archically clustering this data using Euclidean distance and Ward’s 
linkage. Clustering similarity between dendrogram pairs was assessed 
using the Cophenetic correlation coefficient after topologically align- 
ing each pair. 


Feed-forward connectivity model 

To assess whether the observed cortical odour responses are expected 
under arandom feed-forward model we simulated the OB-PCx network 
using a previously established model”. In our implementation of this 
model (which hews as closely as possible to the published model), the 
OB and PCx layer contained 1,000 and 100,000 units, respectively, 
feed-forward connections were assigned randomly and can be either 
excitatory or inhibitory, and each cortical neuron was ‘innervated’ 
by arandom 20% of excitatory OB units and a random 40% of inhibi- 
tory units. Excitatory and inhibitory connection weights were set to1 
and —0.5 respectively, providing each PCx unit with balanced excita- 
tory/inhibitory innervation. Odour-evoked activity in the OB layer 
was simulated from alog-normal multivariate distribution defined by 
population-mean response amplitudes and covariance obtained from 
bouton activity. Model PCx neurons linearly summed their inputs and 
are zero-rectified. The odour-average response fraction of model PCx 
units was adjusted to 8% to match the fraction of excitatory responses 
observed in PCx (detected by auROC analysis) on average across all 
odours and subjects in our experimental data. 


Decoding analysis 

Linear SVM classifiers were trained to predict either odour identity or 
odour class (based on chain-length) in the tiled odour set on the basis 
of odour-evoked population activity. 

For discrimination of odour identity (Fig. 4a), all neurons or boutons 
that responded to at least one odour in the tiled odour set (according 
to auROC analysis) were pooled to build three pseudo-populations of 
neurons or boutons (boutons, L2, L3).Z-scored responses of a popula- 
tion of up ton randomly selected neurons or boutons (the maximum 
common number of neurons recorded across the layers) were then 
considered, given t presentations of j odours as a matrix X withn rows 
(neurons/boutons) and txjcolumns (trials/instances x odours/classes). 
Each column of this matrix was thus a vector of n responses, one for 
each neuron/bouton in response to a given odour in each trial. Each 
decoding session started witha split of the matrix X into a training and 
test set: the training set included 0.9 x trandomly chosen trials for each 
class and the test set comprised the 0.1 x theld out trials for each class 
(that is, astandard 9:1 training:testing split). This procedure, which 
is instantiated as part of the standard LIBSVM library (http://www. 
csie.ntu.edu.tw/-cjlin/libsvm/), allows us to use a binary classification 
algorithm (such as an SVM) to compare multiple classes. In any given 
experiment, the train-test procedure was iterated 100 times (with train- 
ing and test data randomly chosen on each iteration) to cross-validate 
classifier performance. For differently sized subpopulations of neurons 
or boutons, a randomly selected subset of neurons or boutons was 
used for each cross-validation cycle, and at the end of this procedure 
the outcomes of each individual iteration were averaged to generatea 
measure of classification accuracy across all restarts; this is the overall 
measure that is reported in the main text (Fig. 4a). The hyperplanes for 
each classifier were determined using the LIBSVM library witha linear 
kernel, the C-SVC algorithm, and cost c. Cost cis the only free parameter 
for a linear kernel, and it was found by agrid search on an initial dataset 
including 50 randomly chosen neurons/boutons from each dataset in 
order to maximize the accuracy of the decoder classification. 

For Fig. 4b, classifiers were trained on 21 out of 22 odours inthe tiled 
odour set, with all trials associated with any training odour assigned to 
lout of 21 classes. SVM class predictions for each held-out odour were 
converted to confusion probabilities (the probability that any given 
held-out odour is associated with any other of 21 odours) using the 


Python scikit-learn library” implementation of Platt scaling. Class 
probabilities for each tested odour were rank-ordered before averag- 
ing across all odours. 

For Fig. 4c, classifiers were trained to predict the class (either 
short-chain or long-chain) of a held-out odour after training onasingle 
short-chain-long-chain odour pair. All odour trials were presented on 
each train-test iteration and accuracy was determined as the fraction 
of correctly labelled held-out trials. For each randomly chosen sub- 
population of neurons, on each of 100 restarts, training and testing 
was performed onall possible short-chain—-long-chain odour pairs and 
associated held-out odours. Cross-validated generalization accuracy 
corresponds to the average performance across all restarts and folds of 
the data. The short-chain class contained aldehydes: propanal, butanal 
and pentanal; ketones: propanone, butanone and pentanone; esters: 
ethyl and butyl acetate. The long-chain class contained aldehydes: hep- 
tanal and octanal; ketones: hexanone, heptanone and octanone; esters: 
pentyl and hexyl acetates. The acid block is excluded for this analysis. 


Lasso optimization 

For finding small, optimal combinations of Dragon descriptors that 
predict neural odour relationships, an L1-regularized optimization 
routine was designed to maximize the correlation between matched 
odour-pair distances across chemical and neural spaces. During each 
step of optimization (‘L-BFGS-B’ gradient descent), descriptor weights 
(0 bounded) were modified and chemical distances were recomputed. 
The optimization objective sought to minimize the residual sum of 
squares between the modified descriptor distances and corresponding 
neural odour distances. The Lasso component set the sparseness of 
the final solution and was selected for each model by cross-validation. 
Models were trained and validated on individual odour sets containing 
22 molecules with fivefold cross-validation (random splits) such that 
onany split of the data, 17 x (17 —1)/2 odour pairs made up the training 
set and 5 x (22-5) odour pairs comprised the validation set. For assess- 
ing generalization to other odour sets, models were retrained with 
all 22 odours before testing. Because the global and clustered odour 
sets share odours, overlapping odours were removed from training 
when cross-applying to the held-out odour set. For within-odour set 
cross-validation and cross-application to the tiled odour set all odours 
were included. For Extended Data Fig. 5c, we sought to determine the 
relative contribution of the full set of descriptors belonging to the 
‘molecular properties’ block of the Dragon database to bouton/cortical 
relationships; for this analysis, optimization was performed without 
imposing L1-regularization. 


Simulated annealing 
Stimulated annealing was used for odour set design, as well as for pre- 
dictive modelling of neural odour relationships. Simulated annealing is 
awell-validated Monte Carlo sampling variant designed for stochastic 
optimization®. Simulated annealing optimization works by slowly 
decreasing a pre-specified cost function over the course of sampling, 
thereby enabling good initial coverage of solution space and progres- 
sive convergence ona global optimum. All simulated annealing routines 
were implemented using the open-source Python package simanneal 
available at https://pypi.python.org/pypi/simanneal. For odour set 
selection, the number of features was reduced, via PCA, such that the 
transformed odour space accounted for 95% of the original variance. 
Optimization for the global and clustered odour sets (see ‘Odour selec- 
tion’) was carried out using Euclidean distance in this reduced space. 
In addition, our findings using Lasso optimization were verified 
using simulated annealing; here, the simulated annealing objective 
was designed to identify small sets of Dragon physiochemical features 
describing a set of molecules such that molecular distances in chemical 
space were maximally correlated with corresponding odour distances 
in neural space. Qualitatively similar findings using simulated annealing 
were observed as reported in the manuscript using Lasso optimization 
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(data not shown). By design, the chemical variation in the tiled odour 
set, queried here, is focused on chain length and functional group; while 
the observation that bulb and cortex differentially encode information 
about these chemical features clearly demonstrates a representational 
transformation, it does notimply that chain length or functional group 
per se are privileged in either of these brain regions relative to the mul- 
titude of chemical features not captured by the tiled odour set. 


Cross-habituation for assessing perceptual odour similarity 

C57 males (5-6 weeks old) obtained from Jackson laboratories were 
housed onareverse light schedule for 48 h before beginning behavioural 
experiments. Established procedures for assessing odour similarity were 
slightly modified**. Twenty-six pairwise comparisons were obtained, with 
twelve odorants serving as the first odorants in each comparison: butanal 
versus pentanal or propyl acetate; butanoic acid versus heptanoic acid 
or butanone; butanone versus butanoic acid or propyl acetate or buta- 
nal; heptanal versus octanal, or pentyl acetate; heptanone versus pentyl 
acetate or heptanal or hexyl acetate or octanal; hexyl acetate versus pentyl 
acetate or octanone or octanal or octanoic acid; octanal versus butanal 
or heptanal or octanone; octanoic acid versus octanal; octanone versus 
octanoicacid; pentyl acetate versus hexyl acetate; propanoic acid versus 
butanoicacid; propyl acetate versus butanone or butanal or butanoic acid. 

Short-chain pairwise odour comparisons, different functional 
groups (asterisks indicate the same pair was presented at different 
positions in the triplet): butanone versus butanal; butanone versus 
butanoic acid*; propyl acetate versus butanone’; propyl acetate versus 
butanal*; propyl acetate versus butanoic acid. 

Long-chain pairwise odour comparisons, different functional groups 
(asterisks indicate the same pair was presented at different positions in 
the triplet): heptanal versus pentyl acetate; octanal versus octanone; 
heptanone versus pentyl acetate; heptanone versus heptanal; heptanone 
versus hexyl acetate; heptanone versus octanal; octanone versus octanoic 
acid; pentyl acetate versus hexyl acetate*; hexyl acetate veresus octanone; 
hexyl acetate versus octanal; hexyl acetate versus octanoic acid; octanoic 
acid versus octanal. 

Remaining pairs: heptanal versus octanal; octanal versus butanal; 
butanal versus pentanal; propionic acid versus butanoic acid; butanoic 
acid veresus heptanoic acid. 

Because three odours were presented to each mouse, two adjacent 
odour pairs were included in analysis from each mouse. Presentation 
order effects were considered by swapping the order of any given triplet 
in different experiments. As indicated above, for some comparisons the 
same pair was presented at different positions inthe triplet. Investigation 
time was scored manually, using video footage obtained during each 
experiment. Scoring was done blinded to experimental conditions, and 
with no knowledge of odour identity. Odour investigation was defined 
as periods of orienting to the odour source on the half of the cage con- 
taining the odour source as well as by stereotyped bouts of sniffing and 
associated head-bobbing. 

Perceptual similarity (behavioural distance) between two odours was 
defined as the difference in time spent investigating the first odourina 
pair during its last presentation and the investigation time associated 
with the first presentation of the subsequent odour (Extended Data 
Fig. 8). In Fig. 4d, linear regression was used to relate behavioural to 
neural distance. Because behavioural distance increases monotoni- 
cally but not necessarily linearly with neural distance, we also used 
Spearman p, which measures correlation based on ranks and is less 
restrictive than linear regression. Three to twelve mice were used for 
each pairwise comparison (eight mice on average per triplet experi- 
ment). Each mouse was used for a single set of odour comparisons. 


Passive odour exposure 

C57 males (5-6 weeks old) were housed ona reverse light schedule for 
48 h before behavioural training. Group-housed mice were subjected to 
daily odour exposures for a period of two weeks. On each training session 


(30 min; 3 times per day, 14 consecutive days) mice were simultaneously 
presented with two short-chain aldehydes (propanal, butanal) and two 
short-chain ketones (propanone, butanone) for1 min separated bya5-min 
inter-stimulus-interval. Odour delivery was designed to closely approxi- 
mate odour presentation during cortical imaging. In brief, odours were 
deliveredtothe home cage using a custom-built olfactometer consisting 
ofan activated-carbon purification unit, master air flow controllers, and 
avalve bank coupled to four odour vials and one blank vial. Flow rates for 
carrier and odour lines were set to 0.8 and 0.21 min™ respectively. Mono- 
molecular odorants were diluted in DPG according to individual vapour 
pressures to yield a final concentration of 100 ppm at the output of the 
olfactometer. During odour delivery, odorants were mixed in air phase by 
simultaneous opening ofall four valves. During the inter-stimulus-interval, 
air was passed through the blank odour vial containing only DPG. Mice 
were not subject to this protocol on the day of imaging. 


Measuring changes in cortical odour representations after 
mixture exposure 

Cortical representations of all 22 odours of the tiled odour set were 
obtained after cessation of behavioural training. Single neuron and 
population representations of the target aldehydes (propanal and 
butanal) and target ketones (propanone and butanone) were compared 
to data obtained from odour-naive mice exposed to the tiled odour 
set. Off-target control comparisons were made between chain-length 
matched esters and acids (ethyl acetate and propyl acetate versus 
propanoic acid and butanoic acid) as well as between target ketones 
and off-target long-chain aldehydes (heptanal, octanal). 

For comparing changes in response profiles of individual neurons 
across the target classes, a class preference index was assigned to each 
neuron using ROC binary classification. The trial-averaged responses of 
each neuron were labelled as either aldehydes or ketones, resulting in 
asingle auROC value reflecting the degree of discriminability between 
classes. The class preference index was obtained by rescaling auROC 
values froma range of Oto1toarange of -1to1 with O representing low 
discriminability. Because the class preference index combines both 
the magnitude and frequency of responses, aneuron with weak prefer- 
ence for a single class could be strongly and uniformly responsive or 
non-responsive to the target odours. We therefore limited analysis to 
neurons exhibiting at least one response to any short-chain aldehyde or 
ketone of magnitude greater than 2s.d. above the mean response across 
all22 presented odours. Changes to odour similarity at the population 
level were assessed by comparing the correlation distance between 
odour pairs across classes. Because differences in the average magni- 
tude of ensemble correlations (which are not relevant to the pairwise 
restructuring we focus on herein) may uniformly bias comparisons 
between experimental conditions, before computing pairwise correla- 
tion distances (see ‘Chemical and activity distance’), the trial-averaged 
tuning profile of each neuron was initially centred on its mean). 


Descriptor relevance 

The chemical descriptors in Supplementary Table 1identified by Lasso 
optimization afford an algorithmic representation of chemical struc- 
ture. However, each descriptor incorporates some information about 
semantic molecular properties, such as molecular weight, electron- 
egativity, polarizability, ionization potential, molecular volume and 
hydrophobicity. The relevance of each descriptor, obtained from the 
(Dragon, KODE Inc.) website https://chm.kode-solutions.net/prod- 
ucts_dragon_descriptors.php, is presented next to the descriptor name. 


Statistical tests 

For comparing two normal independent distributions, the Student’s 
t-test (two-sided) was used. For comparing two independent distribu- 
tions when normality cannot be assumed, significance was assessed by 
permutation testing or using the two-sided Wilcoxon rank sum test. 
The Kolmogorov-Smirnov test was used to determine equivalence 


between two distributions. For testing significance of a single statistic 
against null distributions obtained by permutation, the true value 
was deemed significant if it resided outside the Sth-95th percentile 
of the null statistic distribution. Error bars refer to 95th confidence 
interval, s.e.m. or s.d. as indicated in the figure legends. For regression 
modelling, confidence intervals are computed over bootstraps (with 
replacement) of the data. For establishing correspondence between 
two distance matrices, Pearson’s product moment correlation was used 
onthe upper diagonal of each set of measurements. 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


All data will be posted to Github or made available upon reasonable 
request (www.github.com/dattalab). 


Code availability 


All code will be posted to Github or made available upon reasonable 
request (www.github.com/dattalab). 
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Extended Data Fig. 1|See next page for caption. 
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Extended Data Fig. 1| Volumetric population imaging of PCx L2 andL3 
during wakefulness using rationally designed odour sets. a, Left, cartoon of 
the volumetric multi-photon imaging approach used to characterize odour 
responses in PCx in wakeful, semi-paralysed mice (Methods). Right, 
approximate position of an imaging volume (green dotted line) inatypical 
experiment superimposed ona Nissl-stained coronal section through PCx. 
Scanning volumes were oriented to acquire similarly sized cortical populations 
inL2 and L3 (red dotted lines), despite decreased neuron density inL3 
(Methods). Imaging was performed inthe most anterior portion of the 
posterior PCx. b, Sample fields of view for a single imaging session. PCx L2is 
depicted ontop; PCx L3 on bottom. Segmentation masks associated with each 
layer are shown on the right. c, Global, clustered and tiled odour sets 
superimposed onthe collection of odours constituting odour space as defined 
by principal components analysis (Methods). Global odours are indicated by 
black dots; tiled and clustered odour sets via the indicated colour code. d, Plot 


of the amount of molecular variance contributed by each additional principal 
component for each odour set in descriptor space; this analysis reveals that 
each odour set tiles odour space at a distinct level of resolution. e, Molecular 
structures and associated photoionization detector (PID) signals of the odours 
comprising the global, clustered and tiled odour sets. These PID traces are 
shown to illustrate the controlled kinetics of the olfactometer only; because 
detector reports depend on the ability of an odour to be photo-ionized, the 
relative amplitudes of the traces between odours are not meaningful. For 
example, heavy aliphatics elicit a minimal PID response because their photo- 
ionization energies lie outside the range of the detector; however, odours with 
low or absent PID traces still induced cortical activity in 5-20% of the imaged 
population, consistent with effective odour delivery. Five odours are shared 
between the global and clustered odour sets. These are indicated by bold 
lettering (and inc, as black circles with coloured edges). Colour codeasinc. 
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Extended Data Fig. 2 | Odour responses in PCx are substantially altered by 
anaesthesia. a, Left, EEG power spectral density plot from an individual 
subject depicting differences in cortical state between ketamine- 
medetomidine anaesthesia and wakefulness (Methods). Under anaesthesia, 
the EEG signal is enriched in the delta band (0.5-4 Hz) at the expense of high 
frequency (40-100 Hz) gamma oscillations; by contrast, gamma activity 
increases and delta activity decreases during wakefulness. Right, summary of 
differences in EEG power content expressed as delta/gamma ratio during 
anaesthesia and wakefulness averaged from four subjects. Error bars indicate 
s.e.m.b, Comparison of the fraction of responsive neurons (obtained from the 
population of neurons that respond to at least one odour during the 
wakefulness) (Methods) to the tiled odour set in the same field of view 
(obtained from PCx L2 and PCx L3) during the awake state and under 
anaesthesia. Responses are defined according to auROC analysis (Methods). 
Each dot represents a single odour (L2: 504 neurons, L3: 418 neurons). c, Top, 


black trace represents heart rate (average over 10s, non-overlapping windows) 
recorded from an awake mouse inthe home cage. Blue traces are example raw 
heart rate (HR) signal indicating the range of heart rate fluctuations observed 
during the awake state. The high variability in heart rates (whichspan 
approximately 350 to 650 beats per min) reflects ongoing behaviour in the 
awake mouse. Bottom, as in the top panel, but for heart rate recorded during 
wakefulness and after induction of ketamine-medetomidine anaesthesia 
(Methods). Grey arrow indicates time of induction. Grey and red rectangles and 
associated inset traces are 20-s segments of real-time heart-rate signal. During 
wakefulness, fluctuations in heart rate remain within a physiologically normal 
range of 300-500 beats per min, without any detectible episodes of 
tachycardia (Methods). Periodic large-amplitude dips inthe recorded heart 
rate during wakefulness reflect moments when pharmacological agents are 
being administered, which briefly interrupts the heart rate monitor. 
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Extended Data Fig. 3 | See next page for caption. 
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Extended Data Fig. 3 | PCx L3 neurons exhibit denser, broader and more 
reliable odour responses than neurons in PCx L2. a, Examples of odour- 
evoked excitation and suppression in PCx. Each panel corresponds to asingle 
cell-odour pair. Grey lines represent individual trials. Coloured overlays 
represent trial-mean activity. Shaded grey rectangles delimit the odour 
presentation period. b, Trial-averaged population response raster depicting 
odour-evoked activity in response to 22 odours (global odour set) across L2 
and L3. Responses are AF/F, with redder colours indicating excitatory 
transients and bluer colours indicating odour-evoked suppression. x axis is 
time; double vertical bars delimit 2-s odour presentation periods. c, Response 
types observed in L2 and L3 (clustered odour set). Individual panels 
correspond to clusters identified using a Gaussian mixture model (Methods). 
Grey traces correspond to trial-averaged cell-odour pairs. Coloured overlays 
represent mean response time course associated with each cluster. Right, 
fraction of all cell-odour pairs exhibiting excitation or suppression. d, 
Response amplitudes of cell-odour pairs obtained from PCx L3 depicted ona 
trial-by-trial basis. Each row represents the response of a given neuronto10 
consecutive presentations of the same odour. Neurons are sorted 
hierarchically using average linkage and correlation distance. Despite the 
presence of some habituation in response to several presentations of the same 
odorant across the experiment, habituation does not appear uniform across 
the neural population nor does it appear to dominate neural responses to 
odours. Different groups of neurons were identified with maximal responses 
to an odour peaking at different times across the experiment; see examples 
depicted onthe right. Each row of traces corresponds toa single cell-odour 
pair. e, At the population level, odour responses do not uniformly habituate 


across the experiment. Top, cartoon depiction of procedure for determining 
change in response amplitude over the course of the experiment for asingle 
cell odour pair. Middle and bottom, pooled data for all cell-odour pairs, sorted 
by layer. Red lines correspond to distribution means (clustered odour set). 

f, Lifetime sparseness distributions (used to quantify tuning breadth) 
(Methods) inL2 and L3 across all experiments (1= perfectly odour selective, 
0=completely non-selective, *P< 0.01, permutation test on layer label). 
Distributions are built using all responsive neurons (significant response to at 
least one odour by auROC analysis) pooled by layer across all experiments 
(here and throughout, global: n=3 mice, L2 = 854 neurons, L3 = 616 neurons; 
clustered: n=3 mice, L2 = 867 neurons, L3 =488 neurons; tiled:n=3 mice, 
L2=427 neurons, L3 = 334 neurons). g, Population sparseness distributions 
(used to quantify response density) (Methods) inL2 and L3 (1=few neurons 
active overall, 0 =all neurons active overall to an equal level). *P< 0.01, 
permutation test on layer label. h, Probability density distributions of 
coefficient of variation for all significant cell-odour pairs identified with 
auROC analysis. *P< 0.01, permutation test on layer label. i, Probability density 
distributions of ensemble correlations (that is, pairwise correlations between 
odour-evoked ensembles) between trial-averaged population odour responses 
in L2 (left) and L3 (middle). Dashed control curves indicate the distribution of 
ensemble correlations after shuffling odour labels independently across 
neurons. Ensemble correlations were determined independently for each 
mouse, and subsequently pooled. *P< 0.01, permutation test on odour label. L3 
exhibits greater correlations at the population level than L2 (right). *P< 0.01, 
permutation test on layer label. 
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Extended Data Fig. 4| See next page for caption. 
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Extended Data Fig. 4| Cortical odour representations are stable from trial 
to trial and not chemotopically organized. a, Left, pairwise odour chemical 
correlation matrices for the global, clustered and tiled odour sets. Rows and 
columnsare sorted according to the chemical similarity between odours as 
assessed by hierarchical clustering (Methods). Middle and right, Pairwise 
correlation distances of single-trial, population representations for odours in 
the global, clustered, and tiled odour experiments in PCx L2 and L3 (and 
boutons for the tiled odour set). Rows and columns are sorted according to the 
chemical similarity between odours as on the left. Chemical colour code (xand 
yaxis labels of matrices, indicating functional group associated with each 
group of molecules) is shown in the legend. R values indicate Pearson’s 
correlation to odour chemistry. b, Top, structured odour relationships persist 
from trial to trial over the course of the experiment. Blue line represents the 
similarity of two correlation distance matrices built from population 
responses obtained on consecutive trials. Grey dashed line indicates mean 


across all trial-pair comparisons (10 trials, 9 trial pairs; clustered odour set, L3). 


Bottom, chemistry-based odour relationships correspond to matched cortical 
relationships obtained on atrial-by-trial basis. Dashed grey line represents the 


similarity of chemical and neural activity distances ona trial-by-trial basis. 

c, Correspondence between odour structure in PCx L3 (clustered odour set) 
and odour chemistry using three different distance metrics (correlation 
distances, Euclidean distances and cosine distances). Distance matrices 
calculated from population activity are obtained using instantaneous AF/F 
over 130 ms increments (Fy: baseline fluorescence averaged over al-s sliding 
window). Vertical lines delimit the 2-s odour presentation. d, Odour chemical 
relationships emerge within a few hundred milliseconds after odour onset and 
persist for several seconds after odour offset (see Extended Data Fig. le for 
associated PID traces). e, Example PCx L2 and L3 FOVs froma single mouse with 
each responsive neuron coloured according to its preferred odour inthe 
clustered odour set. Neurons preferring odours belonging to different classes 
(legend) appear spatially intermingled in bothL2 and L3. f, Contour plots of 
pairwise signal correlations, plotted with respect to distance inL2 and L3 for 
the clustered and tiled experiments. Darker colours indicate increased density 
(see margin distributions). Pearson’s ris overlaid and indicates no spatial 
organization of odour representations in PCx. 
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Extended Data Fig. 5| Lasso optimization identifies parsimonious sets of 
chemical descriptors that predict neural odour relationships. a, Left, 
descriptors identified through training on one odour set alsoimprove 
Pearson’s correlation (r) between corresponding chemical and neural 
distances for held-out sets of odours. C, clustered; G, global; T, tiled. A value of 1 
in the matrix corresponds to no improvement from baseline Pearson’s r value 
after optimization. Baseline chemical-neural correlation is 0.22 for global; 0.48 
for clustered; 0.37 for tiled (see Supplementary Table 1 for optimal descriptor 
sets). Right, reduction in mean-squared error (MSE) between chemical and 
neural odour pair distances for held-out odour sets (indicated below the x axis) 
after training ona single odour set (indicated above). Note that the five odours 
in common between the global and clustered odour sets (names in bold in 
Extended Data Fig. 2e) were discarded when evaluating performance on 
held-out data. The chemical features learned from the tiled odour set improved 
chemical-neural Pearson’s correlations in the clustered odour experiment but 
not the global odour experiment, consistent with the odours belonging to the 
tiled set covering only alimited region of chemical odour space (left). However, 
despite the limited chemical overlap between the tiled and global odour sets, 
training onthe tiled odour set still improved the correspondence between 
odour chemistry and neural responses for the global odour set as assessed bya 
reduction in the mean-squared error (right). b, Identifying a subset of chemical 


Normalized difference (regression coefficient) 


descriptors (from the original superset used to define odour space) using 
Lasso optimization on odour distances improves the correspondence to 
cortical activity (Methods, Supplementary Table 1). Training data were derived 
from the bouton dataset, and testing was performed for bouton responses to 
held-out odours within the tiled odour set, and also to cortical responses of the 
tiled odour set. Data are mean +s.e.m. over cross-validation folds.c, The same 
procedure as inb was performed ona limited subset of 15 semantically relevant 
descriptors that comprise the ‘molecular properties’ block of the Dragon 
database; these descriptors include metrics that reflect molecular properties 
associated with functional groups (for example, donor or acceptor atom 
surface area), molecular weight (for example, van der Waals molecular volume) 
oracombination of both, suchas ‘hydrophilic factor’, and reflect the main axes 
of diversity in the tiled odour set. Most descriptors enriched in the olfactory 
bulb covary with molecular weight (red descriptors). Most descriptors 
enriched in PCx reflect the combined presence of a charged atom and variable 
number of carbon atoms along the aliphatic series of the tiled odour set (blue 
descriptors). Note that these descriptors differ from those identified when 
querying the entire Dragon set using Lasso optimization (Supplementary 
Table 1), as this limited set of targeted descriptors (selected because their 
semantic meaning is transparent) may not afford optimal predictions over 
neural data. 
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Extended Data Fig. 6|See next page for caption. 


Extended Data Fig. 6| Functional imaging of OB axons in PCx via axonally 
targeted GCaMP6s. a, Left, whole-mount depicting Tbx21-Cre-dependent 
expression of AAV PHP.eB hSynapsin1-FLEX-axon-GCaMPé6s in OB projection 
neuronaxons. GCaMPé6s fluorescence is broadly distributed across piriform 
cortex. Right, coronal sections depicting GCaMPé6s signal (green) in the mitral 
cell layer across the entire anterior-posterior extent of the olfactory bulb and 
cortex. Inset, bottom, GCaMPé6s-labelled axons shown coursing through PCx 
Lla. Bottom left, en face image of Lla depicts dense and uniform distribution of 
axonal boutons. b, Difference heat map of atypical field-of-view (FOV) 
depicting baseline and odour-driven fluctuations in GCaMPé6s signal. The 
strongest activation (light colour) is associated with axonal boutons. c, Time- 
averaged fluorescence signal of FOVin b. Overlay shows segmented ROIs 
corresponding to axonal boutons depicting increases (red) or decreases (blue) 


in fluorescence, averaged over multiple presentations of asingle odour from 
the tiled odour set. d, Example average fluorescence from several boutonsina. 
Grey bar indicates odour delivery period, scale bar indicates response 
amplitude. For clarity, fluorescence time courses for each example bouton are 
offset along the y axis. e, Example bouton responses for the tiled odour set. 
Each rowrepresents the trial-averaged response of a single bouton for two 
seconds during and after odour exposure (columns) depicted as z-scored 
AF/F,; rows are sorted hierarchically using correlation distance and average 
linkage. The functional group and carbon chain-length associated with each 
odour are indicated below each column; light-to-saturated gradient indicates 
progression from short-chain to long-chain odours. Note that, as has been 
observed previously for OB projection neurons, boutons exhibit a substantial 
amount of odour-driven suppression. 
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Extended Data Fig. 7 | See next page for caption. 
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Extended Data Fig. 7| Bouton odour response properties. a, Probability 
density distributions for boutons, PCx L2,and PCx L3 for signal correlations. 

b, Left, as ina, but for ensemble correlations. Right, for the top 5% most similar 
odour pairs identified in boutons, correlation for the same odour pairs in PCx. 
Ensemble responses in both PCx L2 and PCx L3 exhibit stronger similarity than 
boutons.c, d, Probability density distributions for boutons, PCxL2 and PCx L3, 
for lifetime and population sparseness. e, Cumulative neural variance 
explained with increasing numbers of principal components, indicating 
relatively higher dimensionality in boutons compared to PCx (that is, more 
uniform distribution of variance across principal components). f, Probability 
density distributions for boutons, PCx L2 and PCx L3 for coefficient of 
variation representing trial-to-trial response variability across cell-odour 
pairs. These data demonstrate that observed odour responses in boutons are 
more reliable than similar responses in the cortex. For a-f, only the tiled odour 
set is used. For lifetime sparseness, 1= perfectly odour selective, O=completely 
non-selective. For population sparseness, 1=few neurons responsive, 0 =all 
neurons equally responsive. Distributions are built using all responsive 


neurons/boutons (significant response to at least one odour by auROC 
analysis; boutons: 3160 ROIs across 6 subjects, PCx L2: 427 neurons across 3 
subjects. PCx L3:334 neurons across 3 subjects). Asterisk indicates significant 
difference between boutons and either L2 or L3:a, vs L2 P< 10’; vs L3 P= 0.02; 
b, vs L2 P<10°; vs L3 P< 0.005; ¢, vsL2 P< 10°; vsL3 P= 0.93; d, vsL2 P<107 vs 
L3 P<10*;f, vsL2: P<10~°; vs L3: P< 10; two-sided Wilcoxon rank sum test for 
all comparisons. g, Single-trial Z-scored AF/F, for 1,000 boutons recordedin 
PCx Lla during presentation of 22 odours belonging to the tiled odour set 
indicated by black lines. Redder colours indicating excitatory transients, and 
bluer colours indicate odour-evoked suppression. h, Response types observed 
in boutons (tiled odour set). Individual panels correspond to clusters identified 
using a Gaussian mixture model (Methods). Grey traces correspond to trial- 
averaged bouton-odour pairs. Coloured overlays represent mean response 
time course associated with each cluster. Blue vertical lines mark periods of 
odour presentation. i, Fraction of total odour-driven bouton variance in each 
individual mouse that can be attributed to the shared across-mouse structure 
as quantified by distance covariance analysis (Methods). 
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Extended Data Fig. 8 | Habituation-dishabituation test for assessing 
perceptual similarity of odour pairs. a, Left, mice presented with new odours 
exhibit investigation that diminishes over several consecutive presentations of 
the same odorant. Subsequent presentation ofa perceptually different odour 
reinstates investigation, and presentation of a similar odour has little effect. 
The extent to which two odorants are perceptually related is assessed by the 
magnitude of rekindled interest in the second odour after habituation has 
occurred to the first. b, Investigation times for two different odour triplets. 
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Data are mean+s.e.m.(n=7 andn=8 mice, respectively). After habituation to 
heptanal, investigation of the closely related octanal (1-carbon difference) only 
modestly increases. Presentation of butanal after habituation to octanal 
(4-carbon difference) induces greater investigation. For the second triplet, 
presentation of heptanal after habituation to heptanone (0-carbon difference, 
different functional group) induces greater investigation, whereas subsequent 
presentation of octanal after habituation to heptanal (1-carbon difference, 
same functional group) induces much less investigation. 
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Extended Data Fig. 9 | Inhibition of the associative network through 
cell-autonomous expression of tetanus toxin light chain in excitatory PCx 
neurons. a, Uniform infection of excitatory pyramidal neurons in PCx L2 and 
L3 with AAV-hSyn-FLEX-TeLC-P2A-NLS-dTom in an Emx1-Cre mouse. b, Left, 
coronal section through PCx indicating placement of recording electrode. 
Right, single-unit odour-evoked activity (grand-average of all excitatory 
responses deemedas significant by auROC analysis) in Emx1-Cre mice 
expressing TeLC or wild-type controls. Disruption of cortical recurrent 
excitation enhances odour-evoked excitation, consistent with disruption of 
feedback inhibition. Grey bar indicates odour presentation (n=121cell-odour 
pairs from two Emx1-Cre mice expressing TeLC; n= 229 cell-odour pairs from 
four mice). c-g, Probability density distributions for the TeLC experiment for 
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signal and ensemble correlations, lifetime and population sparseness, and 
coefficient of variation (constructed as in Extended Data Fig. 7, here only for 
the tiled odour set). For lifetime sparseness, 1= perfectly odour selective, 
0=completely non-selective. For population sparseness, 1= few neurons 
responsive, 0 =all neurons equally responsive. Distributions are built using all 
responsive neurons (significant response to at least one odour by auROC 
analysis; TeLC L2: 435 neurons across 3 subjects. TeLC L3: 590 neurons across 3 
subjects. PCx L2: 427 neurons across 3 subjects. PCx L3:334 neurons across 3 
subjects). Asterisk indicates TeLC is significantly different from PCxL2 or L3:c, 
L2P<10°;L3P<10*; d,L2P<10*;L3 P<10~°;e, L2P<10°™;L3 P<10°;f, L2 
P<107L3P<10°; g,L2:P<10; L3: P< 10*; two-sided Wilcoxon rank sum test 
for all comparisons. 
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obtained from odour-naive (same data as in Figs. 1-4) mice as well as mice short-chain acids with which mice had no previous experience (off-target 
passively exposed toa target mixture of two short-chain aldehydes and two comparisons indicated in legend in black, nave: 334 neurons, n=3 mice; 
short-chain ketones in the home cage (Methods, Fig. 4e, f). Passive experience exposed: 742 neurons, n=3 mice). 
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Reporting Summary 


Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency 
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist. 


Statistical parameters 


When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main 
text, or Methods section). 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND 
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


| For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Clearly defined error bars 
State explicitly what error bars represent (e.g. SD, SE, Cl) 


Our web collection on statistics for biologists may be useful. 


Software and code 


Policy information about availability of computer code 


Data collection Imaging data was acquired using Scanimage 5 by Vidrio. Extraction of cell fluorescence was performed using the open-source software 
Suite2p. The odor library used for designing stimulus sets was obtained from www.thegoodscentscompany.com. Physicohemical 
descriptors were calculated using Dragon 7.0, KODE Inc. 


Data analysis Standard analyses were performed in Matlab and Python. The package "Pyrcca" (https://github.com/gallantlab/pyrcca) was used for 
multi-set canonical correlation analysis. Simulated annealing was performed using the open-source package simanneal 0.4.2 (https:// 
pypi.org/project/simanneal/). 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers 
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 
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Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- A description of any restrictions on data availability 


The data that support the findings of this study are available from the corresponding author upon reasonable request. 
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x] Life sciences Behavioural & social sciences [| Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size Data collected from 18 animals was included in the study: each of three odor sets described in the study was presented to three animals such 
that each animal was exposed once to a single odor set for piriform imaging, the tiled odor set was presented to 3 additional mice for cortical 
imaging in the TeLC experiment, and six additional mice for the bouton imaging experiment. We found that due to the consistency in the 
structure of odor representations apparent across individuals, three animals sufficed for each condition. 


Data exclusions Data exclusion criteria are described in the Methods. Briefly, exclusion criteria were pre-established to select for experiments where imaging 
volumes spanning both Piriform cortical layers 2 and 3 (or in the case of the bouton experiment, Layer 1a) could be imaged continuously for at 
least 2.5 hours with minimal drift and motion artifacts, where a sufficient fraction of cortical neurons was labeled with GCAMP6s and where 
odor-evoked activity could be detected over the course of the entire imaging session. 


Replication The principal findings in this study were present in each individual and are reported in the main text. 
Randomization — Not relevant. Animals were not assigned to different conditions based on whether they originated from the same breeding, litter, or housing 
cage, as all mice were derived from different cages and different breedings. In addition, the consistency in the structure of odor 


representations described in this study suggests that the observations are robust to these variables. 


Blinding Not relevant. The main findings in this study are not contingent on blind comparison of two or more experimental conditions, except for the 
scoring of the behavioral experiment, which was carried out in a blinded fashion. 


Reporting for specific materials, systems and methods 


Materials & experimental systems Methods 

n/a | Involved in the study n/a | Involved in the study 
Unique biological materials ChIP-seq 
Antibodies Flow cytometry 
Eukaryotic cell lines MRI-based neuroimaging 


|_| Palaeontology 


Animals and other organisms 


Human research participants 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Male 8-16 week old C57/BL6J mice; for cortical imaging, mice harboring the Vgat-ires-Cre (Jackson Stock No. 028862) and 
ROSA26-LSL-TdTomato reporter alleles (Jackson Stock No. 007914) were used, and for bulb imaging, mice harboring the Tbet-Cre 


allele (Jackson Stock 024507) were used. 
Wild animals This study did not involve the use of wild animals. 


Field-collected samples This study did not involve samples collected from the field. 
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Meiosis, although essential for reproduction, is also variable and error-prone: rates of 
chromosome crossover vary among gametes, between the sexes, and among humans 
of the same sex, and chromosome missegregation leads to abnormal chromosome 


numbers (aneuploidy)' ®. To study diverse meiotic outcomes and how they covary 
across chromosomes, gametes and humans, we developed Sperm-seq, a way of 
simultaneously analysing the genomes of thousands of individual sperm. Here we 
analyse the genomes of 31,228 human gametes from 20 sperm donors, identifying 
813,122 crossovers and 787 aneuploid chromosomes. Sperm donors had aneuploidy 
rates ranging from 0.01 to 0.05 aneuploidies per gamete; crossovers partially 
protected chromosomes from nondisjunction at the meiosis I cell division. Some 
chromosomes and donors underwent more-frequent nondisjunction during meiosis 
I, and others showed more meiosis II segregation failures. Sperm genomes also 
manifested many genomic anomalies that could not be explained by simple 
nondisjunction. Diverse recombination phenotypes—from crossover rates to 
crossover location and separation, a measure of crossover interference—covaried 
strongly across individuals and cells. Our results can be incorporated with earlier 
observations into a unified model in which a core mechanism, the variable physical 
compaction of meiotic chromosomes, generates interindividual and cell-to-cell 
variation in diverse meiotic phenotypes. 


One way to learn about human meiosis has been to study how genomes 
are inherited across generations. Genotype data are available for mil- 
lions of people and thousands of families; crossover locations are 
estimated from genomic segment sharing among relatives and from 
linkage-disequilibrium patterns in populations”*”*”’. Although inher- 
itance studies sample only the few gametes per individual that gen- 
erate offspring, such analyses have revealed that average crossover 
numbers and crossover locations associate with common variants at 
many genomic loci? °"”. 

Another powerful approach to studying meiosis is to directly visual- 
ize meiotic processes in gametocytes, which has made it possible to see 
that homologous chromosomes usually begin synapsis (their physical 
connection) near their telomeres®; to observe double-strand breaks, 
a subset of which progress to crossovers, by monitoring proteins that 
bind to such breaks”; and to detect adverse meiotic outcomes, such 
as chromosome missegregation’®”’. Studies based on such methods 
have revealed much cell-to-cell variation in features such as the physical 
compaction of meiotic chromosomes?>”1, 

More recently, human meiotic phenotypes have been studied by 
genotyping or sequencing up to 100 gametes from one person, demon- 
strating that crossovers and aneuploidy can be ascertained from direct 
analysis of gamete genomes” ”®. Despite these advances, it has not yet 
been possible to measure multiple meiotic phenotypes genome-wide 
in many individual gametes from many people. 


Development of Sperm-seq 


We developed a method (‘Sperm-seq’) with which to sequence thou- 
sands of sperm genomes quickly and simultaneously (Fig. 1). A key chal- 
lenge in developing Sperm-seq was to deliver thousands of molecularly 
accessible-but-intact sperm genomes to individual nanolitre-scale 
droplets in solution. Tightly compacted” sperm genomes are diffi- 
cult to access enzymatically without loss of their DNA into solution; we 
accomplished this by decondensing sperm nuclei using reagents that 
mimic the molecules with which the egg gently unpacks the sperm pro- 
nucleus (Extended Data Fig. la—d). We then encapsulated these sperm 
DNA ‘florets’ into droplets together with beads that delivered unique 
DNA barcodes for incorporation into the genomic DNA of each sperm; 
we modified three technologies to do this (Drop-seq”’, 10x Chromium 
Single Cell DNA, and 10x GemCode”, the latter of which was used to 
generate the datain this study) (Extended Data Fig. le, f). We then devel- 
oped, adapted and integrated computational methods for determining 
the chromosomal phase of the sequence variants of each donor and for 
inferring the ploidy and crossovers of each chromosome in each cell. 
We used this combination of molecular and computational 
approaches to analyse 31,228 sperm cells from 20 sperm donors 
(974-2,274 gametes per donor), sequencing a median of roughly 1% 
of the haploid genome of each cell (Extended Data Table 1). Deeper 
sequencing allows detection of roughly 10% of a gamete’s genome. 
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Fig. 1| Overview of Sperm-seq. Schematic showing our droplet-based single-sperm sequencing method. 


Sperm-seq enabled us to infer the haplotypes of donors along the full 
length of every chromosome: alleles from the same parental chromo- 
some tend to appear in the same gametes, so the coappearance pat- 
terns of alleles across many sperm enabled us to assemble alleles into 
chromosome-length haplotypes (Extended Data Fig. 2a and Methods). 
In silico simulations and comparisons with kilobase-scale haplotypes 
from population-based analyses indicated that Sperm-seq assigned 
alleles to haplotypes with 97.5-100% accuracy (Extended Data Fig. 2b,c 
and Supplementary Notes). 

The phased haplotypes determined by Sperm-seq allowed us toiden- 
tify cell ‘doublets’ from the presence of both parental haplotypes at 
loci on multiple chromosomes (Extended Data Fig. 2d-fand Methods). 
We also identified surprising ‘bead doublets’, in which two beads’ bar- 
codes reported identical haplotypes genome-wide through different 
single-nucleotide polymorphisms (SNPs), and thus appeared to have 
been incorporated into the same gamete genome (Extended Data 
Fig. 3a, b, Methods and Supplementary Methods). Bead doublets were 
useful for evaluating the replicability of Sperm-seq data and analyses 
(Extended Data Fig. 3c-e), which is usually impossible to do in inher- 
ently destructive single-cell sequencing. 


Recombination rate in sperm donors and cells 


We identified crossover (recombination) events in each cell as transi- 
tions between the parental haplotypes we had inferred analytically 
(Methods). We identified 813,122 crossovers in the 31,228 gamete 
genomes (Extended Data Table 1). Crossover locations were inferred 
witha median resolution of 240 kilobases (kb), with 9,746 (1.2%) inferred 
within 10 kb (Extended Data Table 1 and Supplementary Notes). Analy- 
sis of bead doublets indicated high accuracy of crossover inferences 
(Extended Data Fig. 3e). Estimates of crossover rate and location were 
robust to downsampling to the same coverage in each cell (Extended 
Data Fig. 4 and Supplementary Methods). 

The recombination rates of the 20 sperm donors ranged from 22.2 
to 28.1crossovers per cell. This is consistent with estimates from other 
methods?>°!0 2426 but with far more precision at the individual-donor 
level (95% confidence intervals of 22.0-22.4 to 27.9-28.4 crossovers 
per cell) owing to the large number of gametes analysed per donor 
(Extended Data Table 1 and Extended Data Fig. 5a). Individuals with 
higher global crossover rates had more crossovers on average on each 
chromosome (Extended Data Fig. 5b). We generated genetic maps for 
each of the donors from their 25,839-62,110 observed crossovers; 
these maps were broadly concordant with a family-derived paternal 
genetic map® (Extended Data Fig. 5c, d, Supplementary Notes and 
Supplementary Methods). 

Much more variation was present at the single-cell level: cells rou- 
tinely contained 17 to 37 crossovers (1st and 99th percentiles, median 
across donors), with a standard deviation of 4.23 across cells (median 
across donors), versus a standard deviation of 1.53 across donors’ cross- 
over rates. Among gametes from the same donor, gametes with fewer 
crossovers in half of their genome tended to have fewer crossovers in 
the other half of their genome (Pearson’s r= 0.09, two-sided P=8 x 10“ 
with all gametes from all donors combined after within-donor nor- 
malization) (Supplementary Notes). This relationship, predicted by 
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earlier observations in families’ and spermatocytes”, suggests that 
the crossover number on each chromosomeis partly shaped by factors 
that act nucleus-wide. 


Crossover location and interference 


All20 donors shared a tendency to concentrate their crossovers inthe 
same regions of the genome, with large concentrations of crossovers 
in distal regions, as expected from earlier analyses of families*®?"”°, 
and more modest shared enrichments in many centromere-proximal 
regions (Fig. 2a and Extended Data Fig. 6). Guided by these empiri- 
cal patterns, we divided the genome into ‘crossover zones’, each 
bounded by local minima in crossover density (Extended Data Fig. 6b 
and Supplementary Methods). These zones are much larger-scale 
than fine-scale-sequence-driven crossover hotspots’” **, which the 
spatial resolution of most crossover inferences was not well-suited 
for analysing. 

The crossover zones with the most variable usage across people were 
all adjacent to centromeres; individuals with high recombination rates 
used these zones much more frequently (Fig. 2a and Extended Data 
Fig. 6a; with simulated equal SNP coverage, Extended Data Fig. 4c, e). 
The relative usage of distal and proximal zones varied greatly among 
donors and correlated with donors’ recombination rates (Extended 
Data Fig. 7). These results were robust to alternative definitions of 
‘distal’ versus ‘proximal’ (Extended Data Fig. 7c and Supplementary Notes). 

Positive crossover interference causes crossovers in the same meiosis 
to be further apart than they would be if crossovers were independ- 
ent events”°?°***5, The effect of crossover interference was visible in 
each of the 20 sperm donors (Extended Data Fig. 8 and Supplementary 
Methods). Crossover separation varied greatly among sperm donors 
and correlated inversely with recombination rate (Extended Data 
Fig. 7b)—results that were robust to chromosome composition and 
that applied similarly to same-arm and opposite-arm crossover pairs 
(Extended Data Fig. 7e, fand Supplementary Notes). 

The extremely strong correlations of donors’ crossover rates with 
crossover locations and interference could arise from an underlying 
biological factor that coordinates these phenotypes, or could arise 
trivially from the fact that chromosomes with more crossovers would 
also tend to have crossovers more closely spaced and in more regions. 
To distinguish between these possibilities, we focused on data from 
the 180,738 chromosomes with exactly two crossovers (here called 
‘two-crossover chromosomes’) (Supplementary Notes). Even in this 
two-crossover chromosome analysis, distal-zone usage (Fig. 2b) and 
crossover separation (Fig. 2c) correlated strongly and negatively with 
genome-wide recombination rate (additional control analyses are 
described inthe Supplementary Notes and Extended Data Fig. 7d, g,h). 
These relationships indicate that a donor’s crossover-location and 
crossover-spacing phenotypes reflect underlying biological factors 
that vary from person to person, as opposed to resulting indirectly 
from the number of crossovers ona chromosome. 

To test whether this covariation of diverse meiotic phenotypes also 
governs variation at the single-gamete level, we investigated whether 
cells with more crossovers than the average for their donor also 
exhibit the same kinds of crossover-spacing and crossover-location 
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Fig. 2 | Variation in crossover positioning and crossover separation 
(interference). Colours indicate the crossover rate of donor or cell (blue, low; 
red, high). a, Crossover location density plots for two chromosomes (5 and 13) 
from each donor (n=20). Dashed grey vertical lines show boundaries between 
crossover zones. Mb, megabases. b-e, Crossover positioning and separation 
(interference) on chromosomes withtwo crossovers. b, c, Interindividual 
variation among n=20 sperm donors. Error bars show 95% confidence intervals. 
b, Left, per-cell proportion of crossovers inthe most distal crossover zones 
(Kruskal-Wallis chi-squared = 1,034; df =19; P=2 107°”). Right, mean 
crossover rate versus the proportion of all crossovers (on two-crossover 
chromosomes) occurring in distal zones (Pearson’s r=—0.95; two-sided 
P=8x10").c, Left, density plot of separation between consecutive crossovers 
(Kruskal-Wallis chi-squared = 1,792; df =19; P<10°°°). Right, mean crossover 
rate versus median crossover separation on two-crossover chromosomes 
(Pearson's r=—0.95; two-sided P=7 x10"). d, e, Among-cell covariation of 
crossover rate with distal zone use (d) or crossover interference (e). 
Phenotypes are analysed as percentiles relative to sperm from the same donor. 
Box plots: midpoints, medians; boxes, 25th and 75th percentiles; whiskers, 
minima and maxima. d, Single-cell distal-zone use (the proportion of 
crossovers ontwo-crossover chromosomes that are in the most distal zones) 
versus crossover rate (n cells per decile = 3,152, 3,080 and 3,101 for first, fifth 
and tenth deciles, respectively; Mann-Whitney W=5,271,934.5; two-sided 
P=2%x10~ between first and tenth deciles). e, Single-cell crossover separation 
(the median of all fractions of achromosome separating consecutive 
two-crossover chromosome crossovers in each cell) versus crossover rate 
(Mann-Whitney W=148,548,161, two-sided P=3 x 10° between first (n=11,658) 
and tenth (n= 23,154) deciles; all intercrossover separations used in test). 


phenotypes that donors with high crossover rates do (Supplemen- 
tary Methods). Indeed, two-crossover chromosomes from cells with 
more crossovers tended to have closer crossover spacing and increased 
relative use of non-distal zones (Fig. 2d, e and Extended Data Fig. 7i, j; 
unnormalized results are in the Supplementary Notes). This result 
indicates that the correlated meiotic-outcome biases that distinguish 
people from one another also distinguish the gametes within each 
individual (see Discussion). 


Chromosome and sperm donor aneuploidy 


Aneuploidy generally arises from a chromosome missegregation that 
yields two aneuploid cells: one in which that chromosome is absent 
(aloss), and one in which it is present in two copies (a gain). Among 
the 31,228 gametes, we found 787 whole-chromosome aneuploidies 
and 133 chromosome arm-scale gains and losses (2.5% and 0.4% of 
cells, respectively) (Fig. 3a and Methods). All chromosomes and 
sperm donors were affected. The sex chromosomes and acrocen- 
tric chromosomes had the highest rates of aneuploidy, consistent 
with estimates based on fluorescence in situ hybridization analysis of 
chromosomes”? (Fig. 3b). 

The 20 young (18-38-year-old) sperm donors, considered by clinical 
criteria to have normal-range sperm parameters, exhibited aneuploidy 
frequencies ranging from 0.010 to 0.046 aneuploidy events per cell 
(Fig. 3c and Extended Data Table 1). Permutation tests indicated that this 
4.5-fold variation in observed aneuploidy rates reflected genuine inter- 
individual variation (one-sided P< 0.0001) (Supplementary Notes). 

Under the prevailing model for the origins of aneuploidy, sperm with 
chromosome losses and gains should be equally common. However, we 
observed 2.4-fold more chromosome losses than chromosome gains 
(554 losses versus 233 gains; proportion test two-sided P=2 10°). 
This asymmetry did not appear to reflect technical ascertainment bias 
(Extended Data Fig. 9a and Supplementary Notes). This result is con- 
sidered further in the Supplementary Discussion. 

Errors in chromosome segregation can occur at meiosis I, when 
homologues generally separate, or at meiosis II, when sister chro- 
matids separate. Because recombination occurs in meiosis I before 
disjunction but does not occur at centromeres, errors during meiosis 
I result in chromosomes with different (homologous) haplotypes at 
their centromeres, whereas sister chromatids nondisjoined in meiosis 
II have the same (sister) haplotype at their centromeres (Fig. 3a). (Sex 
chromosomes X and Y disjoin in meiosis I, and the sister chromatids of 
X and Y disjoin at meiosis II.) Encouragingly, for chromosome 21—the 
principal chromosome for which earlier estimates were possible—our 
finding of 33% meiosis | events and 67% meiosis II events matched pre- 
vious estimates from trisomy 21 patients with paternal-origin gains*. 

Across all chromosomes, meiosis I gains and meiosis II gains had very 
different relative frequencies in different individuals and on different 
chromosomes (Fig. 3d, e). For example, sex chromosomes were 2.2 
times more likely to be affected in meiosis I than meiosis II, whereas 
autosomes were 2.0 times more likely to be affected in meiosis II than 
meiosis I (proportion test two-sided P=1.3 x 10). The lack of correla- 
tion between meiosis I and meiosis II vulnerabilities (Fig. 3d, e) indi- 
cated that meiosis I and II are differentially challenging to different 
chromosomes and to different people. 

Although crossovers are required for proper chromosomal segre- 
gation” and seem to be protective against nondisjunction in mater- 
nal meiosis, in which chromosomes are maintained in diplotene of 
meiosis I for decades®, the relationship of crossovers to aneuploidy 
is less clear in paternal meiosis”**°***. We found that chromosome 
gains originating in meiosis I-when recombination occurs—had 36% 
fewer total crossovers than matched, well-segregated chromosomes 
did (Supplementary Methods), suggesting that crossovers protected 
against meiosis I nondisjunction of the chromosomes on which they 
occurred (Extended Data Fig. 9b and Supplementary Notes). No similar 
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Fig. 3 | Aneuploidy insperm from 20 sperm donors. a, Example chromosomal 
ploidy analyses. Thick dark grey line, DNA copy number measurement 
(normalized sequence coverage in 1-Mb bins); blue (haplotype 1) and yellow 
(haplotype 2) vertical lines, observed heterozygous SNP alleles, plotted with 
90% transparency; grey vertical boxes, centromeres (based onthe reference 
hg38 human genome). b-e, Frequencies (number of events divided by number 
of cells) of various aneuploidy categories. b, d,n=23 chromosomes; c,e,n=20 
donors. Error bars are 95% binomial confidence intervals. b, Frequencies of 
whole-chromosome losses versus gains for each chromosome (excluding XY 
chromosomes, Pearson’s r= 0.88, two-sided P=7 x 10°°; including XY 
chromosomes (inset), Pearson’s r=0.99, two-sided P<10°°). c, 


relationship was observed for meiosis II gains (although the simulated 
control distribution for meiosis Ilis inherently less accurate; Supplemen- 
tary Notes) or at other levels of aggregation (Extended Data Fig. 9b-d 
and Supplementary Notes). 


Other chromosome-scale genomic anomalies 


Many sperm had complex patterns of aneuploidy that could not be 
explained by the canonical single-chromosome missegregation event. 
We detected 19 gametes that had three, instead of one, copies of entire 
or nearly entire chromosomes (2, 15, 20 and 21; Fig. 3f and Extended Data 
Fig. 10a, b). Chromosome 15 was particularly likely to be present in two 
extra copies; in fact, sperm with three copies of all or most of chromo- 
some 15 (n=10) outnumbered sperm with two copies of chromosome 
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Per-sperm-donor aneuploidy rates (excluding XY (not shown), Pearson’s 
r=0.51, two-sided P= 0.02; including XY, Pearson’s r= 0.62, two-sided 
P=0.003). d, Frequencies of whole-chromosome gains occurring during 
meiosis I versus meiosis II for each chromosome (excluding XY, Pearson’s 
r=0.32, two-sided P= 0.15; including XY (inset), Pearson’s r= 0.85, two-sided 
P=3%x107).e, Frequencies of whole-chromosome gains occurring during 
meiosis I versus II for each donor (excluding XY (not shown), Pearson’s r= 0.06, 
two-sided P= 0.80; including XY, Pearson’s r= 0.17, two-sided P=0.47).f, 
Example genomic anomalies detected insperm cells, plotted as ina. NC2, NC4, 
NC9 and NC22 signify individual sperm donors; cells are identified by 14-bp 
DNA barcode sequences. 


15 (n = 2) (Fisher’s exact test versus Poisson two-sided P = 2 x 10°’) 
(Supplementary Notes). 

Other gametes carried anomalies encompassing incomplete chromo- 
somes. These included: one cell that gained the p arm of chromosome 
4 while losing the g arm; cells with gains of two copies of achromosome 
arm; and cells with losses of chromosome arms (Fig. 3f and Extended 
Data Fig. 10c, d). One cell carried at least eight copies of most of the g 
arm of chromosome 4 (Fig. 3f). This gamete—which we estimate con- 
tained almost a billion base pairs of extra DNA—carried both parental 
haplotypes of chromosome 4, though almost all of the roughly eight 
copies came from just one of the parental haplotypes (93% of observed 
allelesin the amplified region were haplotype 2). Itis likely that diverse 
mutational processes generate these genomic anomalies (Supplemen- 
tary Discussion). 


Discussion 


Interindividual variation in crossover rates has previously been inferred 
from SNP data from families?’ ¥. Here, highly parallel single-gamete 
sequencing has revealed that sperm donors with high crossover rates also 
exhibit closer crossover spacing, even when controlling forthe number of 
crossovers actually made ona chromosome. On the basis of these analy- 
ses, we consider it most likely that interindividual variation in crossover 
interference is the true driver of variation in crossover rate and placement. 

These same constellations of correlated meiotic crossover pheno- 
types—low interference, high rates and use of centromere-proximal 
zones—tended to characterize the same gametes from any donor. Cells 
with more crossovers in half of their genome tended to have more 
crossovers in the other half, to have made consecutive pairs of crosso- 
vers closer together in genomic distance—even when making just two 
crossovers ona chromosome—and to have placed proportionally more 
of their crossovers in nondistal chromosomal regions. 

We considered what could cause these meiotic phenotypes to covary 
across chromosomes, in individual cells, and among people. The physi- 
cal length of chromosomes during meiosis, which reflects their com- 
paction, has been observed to vary up to twofold among individual 
spermatocytes while being strongly correlated across chromosomes 
in the same spermatocyte; spermatocytes with more-compacted 
chromosomes also generally have fewer incipient crossovers?” 
A unifying model (Extended Data Fig. 11) explains the covariance of 
these meiotic phenotypes while providing a candidate mechanism for 
interindividual variation: cell-to-cell variation in the compaction of 
meiotic chromosomes—and person-to-person variation in the average 
degree of this compaction—would cause these phenotypes to covary 
in the manner observed in Fig. 2b-e. 

Our enthusiasm for this model relies on several additional earlier 
observations (Extended Data Fig. 11). First, at acellular level, crossover 
interference occurs as a function of physical (micrometre-scale) dis- 
tance along the meiotic chromosome axis or synaptonemal complex, 
rather thanasa function of genomic (base-pair) distance** *. Second, 
the first crossover onachromosomeis more likely to occur distally». 
Sucha model also predicts a shared mechanism for sex differences in 
recombination rates and interindividual variation among individu- 
als of the same sex: oocytes have a longer synaptonemal complex, 
more crossovers and decreased crossover interference (as meas- 
ured in genomic distances) than spermatocytes, but have the same 
synaptonemal-complex length extent of crossover interference”*?*°*”, 

Human genetics research has revealed that recombination phe- 
notypes are heritable and associate with common variants at many 
genomic loci? *"”. A recent genome-wide association study found 
that variation in crossover rate and placement is associated with vari- 
ants near genes that encode components of the synaptonemal com- 
plex, which connects and compacts meiotic chromosomes, and with 
genes involved in the looping of homologues along the chromosome 
axis’. Our model predicts that inherited genetic variation at these loci 
may bias the average degree of compaction of meiotic chromosomes; 
the fact that this same property varies among cells from the same 
donor”°”' shows that variance is well-tolerated and compatible with 
diverse-but-successful meiotic outcomes. 

The sharing of covarying phenotypes between the single-cell 
and person-to-person levels suggests that a core biological mecha- 
nism shapes both inter- and intra-individual (single-cell) variation 
in meiotic outcomes. Such parallelisms between cell-biological and 
human-biological variation could in principle exist in a wide variety 
of biological contexts. 
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Methods 


Acompanion protocol for generating single-sperm libraries using the 
methods presented here is available via Protocol Exchange**. Custom 
scripts (available via Zenodo”’) are referenced by name inthe Methods 
sections describing the relevant analyses. Recombination and ane- 
uploidy data generated by the methods described are also publicly 
available°°. All statistical analyses were performed in R unless otherwise 
noted. Details of further analysis methods are provided in the Sup- 
plementary Methods. 


Sample information 

Sperm samples from 20 anonymous, karyotypically normal sperm 
donors were obtained from New England Cryogenic Center under a 
‘not human subjects’ determination from the Harvard Faculty of Medi- 
cine Office of Human Research Administration (protocols M23743-101 
and IRB16-0834). Donors consented at the time of initial donation for 
samples to be used for research purposes. The ‘not human subjects’ 
determination was based onthe use of discarded biospecimens for which 
research consent had been obtained, and on the fact that researchers had 
no interactions with the biospecimen donors and no access to identifi- 
able information about the biospecimens. The reviewing committee 
also reviewed and approved our deposition of the data into a National 
Institutes of Health (NIH) repository. All experiments were performedin 
accordance with all relevant guidelines and regulations. (Specimens can 
be obtained from the New England Cryogenic Center upon Institutional 
Review Board (IRB) approval.) No statistical methods were used to pre- 
determine sample size. As no conditions or experimental groups were 
analysed for this study, no randomization or blinding was performed. 

Samples arrived in liquid nitrogen in ‘egg yolk buffer’ or ‘standard 
buffer with glycerol’ (no further buffer information provided), and were 
aliquoted and stored in liquid nitrogen in the same buffers. 

Per sperm-bank policy, donors are 18-38 years old at the time of 
donation and the precise age of donorsis not released. Donor identifiers 
used here were created specifically for this study and are not linked to 
any external identifiers. 


ddPCR to evaluate genome accessibility 

To evaluate how often regions from two different chromosomes 
co-occurred (as would be expected from cells), we performed droplet 
digital polymerase chain reaction (ddPCR) with naked DNA, untreated 
sperm cells or sperm cells decondensed as described below but with 
variable heat incubation times. For each assay targeting each chro- 
mosome, we created a 20x assay mix by combining 25.2 pl of 1OO uM 
forward primer (from IDT), 25.2 pl of 100 pM reverse primer (IDT) and 
7 pl of 100 uM probe (IDT for fluorescein amidite (FAM)-labelled probes; 
Life Technologies for VIC-labelled probes) with 82.6 pl ultrapure water. 
We carried out ddPCRas previously described™, following section 3.2 
steps 4-12, but with untreated sperm or sperm DNA florets as input 
instead of DNA. 

For this analysis, we targeted chromosome 7 with an assay directed 
to intergenic region chr7:106552149-106552176 (hg38): forward 
primer sequence CGTAATGGGGCACAGGGATATA; reverse primer 
sequence CTGTGAGAGGTAGAGAATCGCC; probe sequence CAC 
AGAGTCCATTTGCAGCACCTCAGT; probe fluorophore FAM. We 
targeted chromosome 10 with an assay for the RPP30 gene at 
chr10:92631759-92631820: forward primer sequence GATTTGGA 
CCTGCGAGCG; reverse primer sequence GCGGCTGTCTCCACAAGT; 
probe sequence CTGACCTGAAGGCTCT; probe fluorophore VIC. 
We calculated the percentage of molecules expected to be linked from 
each reaction as previously described”. 


Sperm cell library generation 
We generated accessible sperm nuclei ‘florets’ using a combination 
of published decondensation protocols with some modifications. 


Sperm aliquots containing more than 200,000 cells were thawed on 
ice and then washed by spinning for 10 min at 400g at 4 °C. The pel- 
let was resuspended in 10 pl phosphate-buffered saline (PBS, Gibco/ 
LifeTechnologies) and recentrifuged under the same conditions. The 
sperm pellet was resuspended in 2.5 ul of a sucrose buffer containing 
250 mM sucrose (Sigma), 5 mM MgCl, (Sigma) and 10 mM Tris HCI 
(pH 7.5, Thermo Scientific). Sperm aliquots were submerged in liquid 
nitrogen and immediately quick-thawed by holding them in a warm 
fist; three such freeze-thaw cycles were performed. 

Freeze-thawed sperm solution was combined with 22.5 pl deconden- 
sation buffer (113 mM KCI (Sigma), 12.5 mM KH,PO, (Sigma), 2.5 mM 
Na,HPO, (Sigma), 2.5 mM MgCl, (Sigma) and 20 mM Tris (Thermo 
Scientific) freshly supplemented with 150 pM heparin (sodium salt from 
porcine, Sigma catalogue number H3393) and 2 mM B-mercaptoethanol 
(Sigma)). The reaction was incubated at 37 °C for 45 min. To allow enzy- 
matic DNA amplification, heparin was inactivated by mixing the sperm 
solution with 0.5 U heparinase 1 (Sigma H2519) by gently pipetting and 
incubating at room temperature for 2 h (ref. *). 

The sperm solution was moved to ice, and sperm floret concentra- 
tion was determined by diluting 1:100 with PBS and staining with 1x 
SYBRI (Thermo Scientific), then counting using the green fluorescence 
channel at 10x magnification. 

Droplets were prepared using the following modifications to 10x 
Genomics’ GemCode (version 1; ref. ’) user guide revision C (in place 
of steps 5.1-5.3.9); all reagents come from the 10x Genomics GemCode 
kit. Ultrapure water was combined with 10,833 sperm toa final volume 
of 5 pl; 10,000 sperm were used for library generation. To each sperm 
sample was added 60 ul of a master mix containing 32.5 pl GemCode 
reagent mix, 1.5 pl primer release agent, 9.2 pl GemCode polymerase 
and 16.8 pl ultrapure water. 

GemCode beads were vortexed at full speed for 25s, and then diluted 
1:11 with ultrapure water to a total volume of at least 90 pl per sample. 
Per 10x Genomics’ GemCode’s protocol, 60 pl of the sample-master 
mix combination was added to the droplet generation chip, followed 
by 85 pl of freshly pipette-mixed 1:11-diluted bead mixture and 150 ul 
of droplet generation oil. 

Droplets were generated and processed by library generation fol- 
lowing 10x Genomics’ GemCode (version 1) user guide revision C (step 
5.3.10 through to the end of section 6). 


Sequencing and sequence data processing 

We generated two libraries per sperm donor and additional libraries 
for four initial samples with low cell counts. We sequenced four or five 
libraries at a time on S2 200 cycle flow cells on an Illumina NovaSeq. 
The read structure was 178 cycles for read 1, 8 cycles for read 2 (index 
read one), 14 cycles for read 3 (index read two containing the cell bar- 
code; later treated as the reverse read), and 5 cycles for read 4 (unused; 
included to fulfil the NovaSeq’s paired-end requirement). 

To convert the data to mapped binary alignment map (BAM) files 
with cell and molecular barcodes encoded as read tags, we used Pic- 
ard Tools v.2.2 (http://broadinstitute.github.io/picard) and Drop-seq 
Tools v.2.2 (https://github.com/broadinstitute/Drop-seq/releases; 
see https://github.com/broadinstitute/Drop-seq/blob/master/doc/ 
Drop-seq_Alignment_Cookbook.pdf for details on running many of 
the tools)”. 

Illumina binary base call (BCL) files were converted to unmapped 
BAM files using Picard’s ExtractllluminaBarcodes and IIluminaBase- 
callsToSam with read structure 178T8B14T (cell barcodes, present 
in the iS index, were incorporated as read 2 for ease of downstream 
processing). BAMs were processed to include unique molecular 
identifiers (UMIs) and cell barcodes as read tags, and to exclude 
reads with poor-quality cell barcodes or UMIs; consequently, each 
read was retained as single-end with a 14-base-pair (bp) cell barcode 
stored in tag XC and a10-bp molecular barcode/UMI stored in tag 
XM. The first 10 bp of read 1 were used as the UMI. First, DropSeq 
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Tools’ TagBamWithReadSequenceExtended was called with BASE_ 
RANGE = 1-14, BASE_ QUALITY =10, BARCODED_READ = 2, DISCARD_ 
READ = true, TAG_ NAME = XC, NUM_BASES BELOW_QUALITY = 1. 
Subsequently, TagBamWithReadSequenceExtended was called again 
with BASE_RANGE=1-10, BASE QUALITY =10, HARD_CLIP_BASES = true, 
BARCODED READ =1, DISCARD_READ = false, TAG_NAME = XM, NUM_ 
BASES BELOW _QUALITY =1. Finally, DropSeq Tools’ FilterBAM was 
called with parameter TAG_REJECT = XQ. 

Reads were aligned to hg38 using bwa mem® v.0.7.7-r441. BAMs 
were converted to FastQ using Picard’s SamToFastQ, FastQ reads 
were aligned using bwa mem -M, and then unmapped BAMs were 
merged with mapped BAMs using Picard’s MergeBamAlignment, with 
non-default options INCLUDE SECONDARY_ALIGNMENTS = false and 
PAIRED_RUN = false. Reads were marked PCR duplicates using Drop-seq 
Tools’ SpermSeqMarkDuplicates (part of Drop-seq tools v2.2 and above) 
with options STRATEGY = READ POSITION, CELL_BARCODE_TAG=XC, 
MOLECULAR _BARCODE_TAG = XM, NUM_BARCODES = 20000, CRE- 
ATE_INDEX = true. BAM files for all lanes and index sequences from 
the same sample were merged using Picard’s MergeSamFiles before 
alignment and/or during duplicate marking with all BAMs given as 
input to SpermSeqMarkDuplicates. 


Variant calling and sperm cell genotyping 

For each donor, we pooled all reads from all libraries, including 
reads that did not derive from a barcode associated with a complete 
sperm cell. Using GATK v.3.7 (refs. *”°8) in hg38, we followed GATK’s 
best-practices documentation for base quality score recalibration; for 
genomic variable call format (gVCF) generation using HaplotypeCaller 
(in ‘discovery’ mode with -stand_call_conf20); and for joint genotyping 
with GenotypeGVCFs. We filtered variants with SelectVariants —select- 
Type SNP and VariantFiltration (-filterExpression ‘QD <3.0’). Wethen 
performed variant quality score recalibration (VQSR) following GATK’s 
best practices, except that we excluded annotations MQ and DP (Vari- 
antRecalibrator with GATK provided resources; —an QD, MQRankSum, 
ReadPosRankSum, FS and SOR; —mode SNP; -trustAllPolymorphic; 
and tranches 90, 99.0, 99.5, 99.9 and 100.0). We applied tranche 99.9 
recalibration using ApplyRecalibration -mode SNP and obtained the 
names of SNPs from SNP database (dbSNP) build 146 (ref.°’) using Vari- 
antAnnotator —dbsnp. We filtered our sites to contain only those bial- 
lelic SNPs that were present in Hardy-Weinberg equilibrium in 1000 
Genomes Phase 3 (ref. ©) using SelectVariants -concordance witha 
VCF containing only these sites (from GATK’s resource bundle). We 
excluded SNPs in centromeric regions or acrocentric arms, as defined 
by the University of California Santa Cruz (UCSC)’s Genome Browser’s 
cytoband track” (http://genome.ucsc.edu; the same centromere 
boundaries were used in all analyses), and those in known paralogous 
regions”. We selected only heterozygous SNPs using SelectVariants 
-selectType SNP-selectTypeToExclude INDEL-restrictAllelesTo BIAL- 
LELIC-excludeFiltered-setFilteredGtToNocall-selectexpressions ‘vc. 
getGenotype(“<sample name>”’””).isHet()’. 

We identified the SNPs presentin each sperm cell and the allele that 
was present using GenotypeSperm (part of Drop-seq Tools v.2.2 and 
above). For downstream analyses, we generated a file with columns 
cell, pos and gt, with gt having the value O for the reference allele and 
1 for the alternate allele for SNPs that had one or more UMIs covering 
only one base matching the reference or alternate allele (see our script 
gtypesperm2cellsbyrow.R). 


Chromosome:scale phasing 

We identified barcodes that were potentially associated with cells by 
plotting the cumulative fraction of reads associated with each ranked 
barcode and identifying the inflection point of this curve (Extended 
Data Fig. 1f). We then included only those barcodes that had substantial 
read depth on either the X or the Y chromosome but not both, as the 
vast majority of sperm cells should contain only one sex chromosome. 


(We later added these barcodes back in before formally identifying and 
excluding cell doublets.) 

To phase sperm donors’ genomes, we used all quality-controlled 
heterozygous sites in these cell barcodes expected to correspond to 
sperm cells, excluding observations of SNPs for which the observed 
allele was not the reference or alternate allele in the parental genome, 
or for which more than oneallele was observed. For each chromosome, 
we converted per-cell SNP calls into ‘fragments’ for input into the Hap- 
CUT phasing software®*® by considering each consecutive pair of SNPs 
observed ina cell to bea fragment (see our script gtypesperm2fmf.R). 
We then used HapCUT with parameter -maxiter 100 to generate chro- 
mosomal phase. After identifying and removing cell doublets (see 
below), we repeated phasing with only non-doublet cell barcodes. 

To validate our phasing method, we simulated single-cell SNP 
observations from known haplotypes, including 2% genotype errors 
and a variable percentage of cell doublets. In brief, sites were ran- 
domly sampled from one known haplotype of chromosome 17 until 
a crossover location was probabilistically assigned on the basis of 
the deCODE recombination map*, then sampled from the other hap- 
lotype (one crossover was simulated per cell). To simulate PCR or 
sequencing errors, 2% of the sites were randomly assigned to anallele. 
Doublets were simulated by combining two cells and retaining 70% 
of the observed sites at random. We performed five random simula- 
tions for each doublet proportion, for the mean proportion of sites 
‘observed’ in each cell, and for the number of cells simulated, and then 
followed our phasing protocol using each simulation (see our script 
simulatespermseqfromhaps.py). 

To further validate phasing, we used Sperm-seq data to phase one 
donor’s genome and compared these phased haplotypes to this donor’s 
Eagle®°*’-generated haplotypes. We compared the phase relationship 
between each consecutive pair of SNPs (identifying the proportion 
of switch errors between the two phased sets). We also compared the 
Sperm-seq allele—allele phase of all pairs of alleles in perfect linkage 
disequilibrium in 1000 Genomes Phase 3 (ref. °°) in the populations 
matching the donor’s ancestry. 


Cell doublets 

To identify cell barcodes associated with more than one sperm cell 
(cell doublets), we detected consecutively observed SNP alleles that 
appeared on different parental haplotypes, which could occur because 
of crossover, error, or the presence of two haplotypes in the same drop- 
let (doublet). We ranked barcodes by the proportion of consecutive 
SNPs that spanned haplotypes by using all SNPs from all autosomes 
except the autosome with the most haplotype-spanning consecutive 
SNPs (so as to avoid mistakenly identifying cells with chromosome 
gains as doublets); this resulted ina clear inflection point wherein cell 
doublets had a quickly accelerating proportion of haplotype-spanning 
consecutive SNPs (Extended Data Fig. 2d-f). All cell barcodes below 
this inflection point (identified with the function ‘ede’ from the R pack- 
age ‘inflection’ https://CRAN.R-project.org/package=inflection) were 
considered non-doublet (Extended Data Fig. 2f) (see our script com- 
puteSwitchesandInflThresh.R). Even though we specifically exclude 
the autosome with the most haplotype-spanning consecutive SNPs 
from doublet identification, any cells with multiple chromosome 
gains (especially more than two) or whole-genome diploidy would be 
excluded by this method. 


Crossover events 

We identified crossover events on all autosomes (but excluded 
the p arms of acrocentric chromosomes for which SNPs were 
excluded from analysis) by finding transitions between tracts of 
SNPs with alleles that match different parental haplotypes using 
a hidden Markov model written in R with package ‘HMM’ (https:// 
CRAN.R-project.org/package=HMM). To ensure that we detected 
crossovers located near the ends of SNP coverage (subtelomeric 


regions are frequently used for crossovers in spermatogene- 
sis), we ran the HMM in both the forward-chromosomal and the 
reverse-chromosomal directions, with the start probability for one 
haplotype equal to 1if the first two SNPs observed were of that hap- 
lotype. In addition to two states for parental haplotypes, we included 
a third ‘error’ state to capture cases in which a haplotype 1 allele is 
observed in a haplotype 2 region (and vice versa), for example, owing 
to PCR or sequencing error, gene conversion, or cases in whicha 
small piece of off-haplotype ambient DNA was captured ina drop- 
let. Crossovers were where one haplotype transitioned to another, 
or where one haplotype transitioned to the error state and then to 
the other haplotype. Crossover boundaries were the last SNP in the 
first haplotype and the first in the next. The key parameters for this 
algorithm are the transition probability between haplotypes (set to 
0.001, from the per-cell median 26 crossovers divided by the per-cell 
median 24,710 heterozygous SNPs) and transition probability into 
and out of the ‘error’ state (we set the probability of transition into 
this state to 0.03 from either haplotype, as only a few percent of 
SNPs are off-haplotype; we set the probability of staying in error to 
0.9 to allow for the occasional tract of SNPs from an ambient piece 
of off-haplotype DNA). Emission probabilities were 100% haplotype 
lalleles from haplotype 1, 100% haplotype 2 alleles from haplotype 
2, and equal probability haplotype 1 or 2 alleles from the third ‘error’ 
state. Crossover calling was robust to a range of low transition prob- 
abilities (see our script spseqHMMCOCaller_3state.R, which calls 
crossovers on one chromosome). 

After aneuploidy identification, we marked aneuploid chromosomes 
as having no crossovers for all crossover analyses (absent chromosomes 
have no crossovers and crossovers are called differently on gained 
chromosomes, described below). 


Identifying even-coverage cell barcodes 

We used Genome STRIP v2.0 (http://software.broadinstitute.org/soft- 
ware/genomestrip/)®” to determine sequence read depth (observed 
number of reads divided by expected number of reads) in bins of 100 kb 
of uniquely mappable sequence across the genome in each sperm cell, 
using Genome STRiP’s default guanine-cytosine (GC) bias correction 
and repetitive region masking for reference genome gr38. We divided 
the read depth by twoto obtain the read depth per haploid rather than 
diploid genome. Input to Genome STRiP was a BAM file containing only 
cells of interest, with read groups set to < sample name>:<cell barcode> 
(created using Drop-seq Tools’ ConvertTagToReadGroup with options 
CELL_BARCODE_TAG= XC, SAMPLE_NAME=<name of sample/donor>, 
CREATE_INDEX = true, and CELL_BC_FILE=list of barcodes potentially 
associated with cells, described above). 

A minority of cell barcodes were associated with eccentric read 
depth across many chromosomes, with wave-like read depth vacillat- 
ing between O and 2 or more. (We hypothesize that these cell barcodes 
were associated with sperm nuclei that did not properly decondense, 
such that some regions of the genome were more accessible than oth- 
ers, leading to undulating read depths across more- and less-accessible 
chromatin.) To identify and exclude such barcodes, we treated read 
depths across each chromosome as atime series and used Box—Jenkins 
autoregressive integrated moving average (ARIMA) modelling to model 
how read depth observations relied on their previous values and their 
overall averages (implemented via the R package ‘forecast’, exclud- 
ing differencing). By visual inspection, we determined that chromo- 
somes with certain ARIMA criteria were likely to have an undulating 
read depth, and that cell barcodes with five or more such identified 
chromosomes were likely to have eccentric read depths globally. We 
flagged individual chromosomes if: (1) the sum of the AR1 and AR2 coef- 
ficients was greater than 0.7, the AR1 coefficient was greater than 0.9, 
or the net sum ofall AR and MA coefficients was greater than 1.25; and 
(2) either the net sum of AR and MA coefficients was greater than 0.4 
or the intercept was less than 0.8 or greater than 1.2. If both criteriain 


(2) were met, this signified an exceedingly odd chromosome, which we 
counted twice. Cell barcodes with five or more chromosomes flagged 
in this way were excluded from downstream analyses. (Because gains of 
large amounts of the genome cause artificially depressed read depths 
on nongained chromosomes, we manually examined any cells witha 
large range of ARIMA intercepts and more than five chromosomes 
denoted as unstable. Any such cells that had simply gained a large pro- 
portion of the genome-—for example, three copies of chromosome 
2-—were included rather than excluded). We cross-referenced all cell 
exclusions with called aneuploidies, confirming that cells were not 
excluded simply on the basis of having lost or gained a chromosome 
(see our Scripts setupgsreaddepth.R, exclbadreaddepth_arima_1.R, 
exclbadreaddepth _initid_2.R, and exclbadreaddepth finalize 3.R). 


Replicate barcodes (‘bead doublets’) 

One sperm cell can be encapsulated in a droplet with more than one 
barcoded bead. To identify such cases, where pairs of sperm genomes 
were identical, we determined the proportion of SNPs that were of 
the same haplotype for each pair of barcodes. We imputed the hap- 
lotype of all heterozygous SNPs on the basis of the haplotype of sur- 
rounding observed SNPs and locations of recombination events, and 
compared SNP haplotypes across sperm cell pairs. SNP observations 
between boundaries of crossovers were excluded from analysis. 
Sperm cells shared on average 50% of their genomes, but a few sets 
of barcodes shared nearly 100% of their SNP haplotypes (Extended 
Data Fig. 3a). We considered these pairs to be ‘bead doublets’ or repli- 
cate barcodes. In all downstream analyses, we used only one barcode 
(chosen randomly) froma set corresponding to the same cell (see our 
scripts imputeHaplotypeAllSNPs.R, compareSpermHapsPropSNPs.R, 
combineChrsSpermHapsPropSNPs.R, and curateNonRepBCList.R). 


Crossover zones 

To define regions of recombination use, we found local minima of the 
density (built-in function in R) of all crossovers’ median positions across 
all samples on each chromosome. Minima were identified using the 
findPeaks function (from https://github.com/stas-g/findPeaks) on 
the inverse density with m = 3. Crossover zones run from the begin- 
ning of the chromosome (including the whole p arm for acrocentric 
chromosomes) to the location of the first local minimum, from the 
location of the first local minimum plus one base pair to the next local 
minimum, and so on, with the last zone on each chromosome ending 
at the chromosome end (see our script findcozones_peaks.R). 


Aneuploidy and chromosome arm loss/gain 

As described previously (see Methods section ‘Identifying 
even-coverage cell barcodes’), we used Genome STRIP (http://soft- 
ware.broadinstitute.org/software/genomestrip/)*®*” to determine read 
depth in each sperm cell in 100-kb bins. We located chromosomes or 
chromosome arms with aberrant read depth to identify aneuploidy. 

We excluded genomic regions that had outlying read depths across 
all cells, defined as those with P< 0.05 ina one-sided one-sample ¢-test 
(looking for increased read depth) against the expected mean read 
depth of 2# (defined below). To identify gains of autosomes, we per- 
formed a one-sided one-sample t-test (expecting increased read depth 
ina gain) for each cell against the expected read depth for a gain of 
one copy, 2#. For each cell, this analysis compared the distribution of 
all bins’ read depth across a region of interest to the gain expectation 
2#, and flagged any cells whose read depth distributions were not sig- 
nificantly different (P> 0.05). We used the same approach to identify 
losses, comparing acell’s read depth distribution across bins to 0.1and 
flagging any that were not significantly higher (P > 0.05). 

The expected copy number for gains is 2, but the expected read 
depth for gains depends on the size of the chromosome: a library 
corresponding to a cell with a chromosome gain has more reads than 
would bein that same library without a gain. This phenomenon pulls the 


Article 


read depth down globally by increasing the total number of expected 
reads, causing the denominator in each read depth bin (the expected 
number of reads in that bin) to increase. Therefore, we computed a 
chromosome:specific critical read depth value for identifying gains: 
2# = 2*(the proportion of the genome in base pairs coming from all 
chromosomes other than the tested one). For losses, we used 0.1 rather 
than O as the expected read depth, because a small number of reads 
generally align to every chromosome in every library. 

For nonacrocentric chromosomes, we performed aneuploidy call- 
ing for the arms separately and for the whole chromosome. Because 
amplification of more than two copies of a chromosome arm could 
result in the whole chromosome passing the P-value threshold, we 
required a whole-chromosome event to pass the P-value threshold at 
the whole-chromosome level and to have a rounded read depth for 
both arms of 2 or more for a gain (or O fora loss). For the acrocentric 
chromosomes, only the g arm was considered, and any g arm gain or 
loss was considered to be a whole-chromosome event (unless inves- 
tigated further). 

For thesex chromosomes, we followeda similar statistical framework, 
but a loss was considered an aneuploidy only if both the X and the Y 
chromosomes were flagged as lost. A gain was called ifboth X and Y chro- 
mosomes were present (see our scripts setupgsreaddepth.R, idaneus_ 
initialttests.R, curateaneudata_clean.R, getautosomalaneumatrix.R 
and getxykaryos_aneus.R for aneuploidy calling and output 
formatting; see our scripts curateAnFreqFromCodeMatrix.R, 
curateInitAnalyzeXxYKaryos.R and combineAnFreq_AutXY.R for con- 
version of outputs of aneuploidy calling to cross-donor aneuploidy 
frequency tables). 


Division of origin for chromosome gains 
To see when chromosome gains originated, we determined whether 
the centromeres of the multiple copies of the chromosomes were het- 
erozygous and therefore from homologues, which typically disjoinin 
meiosis I, or homozygous and therefore from sister chromatids, which 
typically disjoin in meiosis II. We identified heterozygous regions for 
all cells using a hidden Markov model (HMM) in which the states are: (1) 
heterozygous (emitting either haplotype’s alleles), or (2) homozygous 
(emitting only one haplotype’s alleles), with the transition probability 
between the states equal to the recombination transition probability. 
For each gain, we determined whether heterozygous tracts overlapped 
the centromere. If a heterozygous tract started before the start of the 
centromere and ended after the end of the centromere, or started at 
the first SNP observed on an acrocentric chromosome or within the 
first ten SNPs and was more than ten SNPs long, then chromosome was 
classified as a meiosis I gain; if no heterozygous tract overlapped the 
centromere, it was classified as a meiosis II gain (see our scripts getDip- 
loidTracts hmm.R, originOfGainID.R and curateOriginMultSamps.R). 
At the sex chromosomes, any XY sex chromosome gain derives from 
meiosis (X and Y are homologues), whereas an XX or YY gain derives 
from meiosis II (sister chromatids are duplicated). 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Crossover and aneuploidy data (individual events and counts per donor 
and/or per cell), including the source data underlying Figs. 2, 3b-e 
and Extended Data Figs. 5-9, are available via Zenodo at https://doi. 
org/10.5281/zenodo.2581570. Raw sequence data are available in the 
Sequence Read Archive (SRA) (https://www.ncbi.nIm.nih.gov/sra) via 
the Database of Genotypes and Phenotypes (dbGaP) (https://www. 
ncbi.nlm.nih.gov/gap/) for general research use upon application and 
approval (study accession number phs001887.v1.p1). 


Code Availability 


Analysis scripts and documentation are available via Zenodo at https:// 
doi.org/10.5281/zenodo.2581595. 
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Extended Data Fig. 1| Characterization of egg-mimic sperm preparation 
and optimization of bead-based single-sperm sequencing. a-c, 
Two-channel fluorescence plots showing the results of ddPCR with the input 
template noted above each panel, demonstrating that two loci (from different 
chromosomes) are detectable in the same droplet far more often when sperm 
DNA florets (rather than purified DNA) are used as input. Each point represents 
one droplet. Grey points (bottom left) represent droplets in which neither 
template molecule was detected; blue points (top left) represent dropletsin 
which the assay detected atemplate molecule for the locus on chromosome 7; 
green droplets (bottom right) represent droplets in which the assay detecteda 
template molecule for the locus on chromosome 10; and brown points (top 
right) represent droplets in which both loci were detected. Witha high 
concentration of purified DNA as input (a), comparatively fewer droplets 
contain both locithan when untreated (b) or treated (c) sperm were used as 
input. Sperm ‘florets’ treated with the egg-mimicking decondensation 
protocol hada much higher fraction of droplets containing both loci than did 
purified DNA (compareawithc, left) and had more-sensitive ascertainment 
and cleaner results (quadrant separation) than untreated sperm (compareb 
withc, right). The pink lines in b delineate the boundaries between droplets 
categorized as negative or positive for each assay. d, Optimization of sperm 
preparation: characterization of the effect of different lengths of 37 °C 
incubation of sperm cells treated with egg-mimicking decondensation 
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reagents on how often the locion chromosomes 7 and 10 were detected inthe 
same ddPCR droplet. The y-axis shows the percentage of molecules that are 
calculated to be linked to each other (that is, physically linkedin input) for 
assays targeting chromosomes 7 and 10. Extracted DNA (‘DNA’, anegative 
control) gives the expected result of random assortment of the two template 
molecules into droplets. The 45-min heat treatment was used for all 
subsequent experiments in this study. e, f, Distribution of sequence reads 
across cell barcodes from droplet-based single-sperm sequencing. Each panel 
shows the cumulative fraction of all reads from a sequencing run coming from 
each read-number-ranked cell barcode; a sharp inflection point delineates the 
barcodes with many reads from those with few reads. Points to the left of the 
inflection point are the cell barcodes that are associated with many reads (that 
is, beads that are coencapsulated with cells); the height of the inflection point 
reflects the proportion of the sequence reads that come from these barcodes. 
Only reads that mapped to the human genome (hg38) and were not PCR 
duplicates are included. e, Data from an initial adaptation of 10x Genomics’ 
GemCode linked reads system”, where a small proportion of the reads come 
from cell barcodes associated with putative cells. f, Data from the final, 
implemented adaptation of 10x Genomics’ GemCode linked reads system” for 
the same number of input sperm nucleias ine. The x-axis inf includes five times 
fewer barcodes than ine. 
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Extended Data Fig. 2 | Evaluation of chromosomal phasing and 
identification of cell doublets. a, Phasing strategy. Green and purple denote 
the chromosomal phase of each allele (unknown before analysis). Eachsperm 
cell carries one parental haplotype (green or purple), except where a 
recombination event separates consecutively observed SNPs (red Xin bottom 
sperm). Because alleles from the same haplotype will tend to be observed inthe 
same sperm cells, the haplotype arrangement of the alleles can be assembled at 
whole-chromosome scale (resulting in the phased donor genome). 

b, Evaluation of our phasing method using 1,000 simulated single-sperm 
genomes (generated from twoa priori known parental haplotypes and 
sampled at various levels of coverage). Because cell doublets (which combine 
two haploid genomes and potentially two haplotypes at any region) canin 
principle undermine phasing inference, we included these doublets in the 
simulation (in proportions shown on the x-axis, which bracket the observed 
doublet rates). Each point shows the proportion of SNPs phased concordantly 
with the correct (a priori known) haplotypes (y-axis) for one simulation (five 
simulations were performed for each unique combination of proportion of cell 
doublets and percentage of sites observed). c, Relationship of phasing 
capability to number of cells analysed. Data areas in b, but for different 
numbers of simulated cells. All simulations had an among-cell mean of 1% of 
heterozygous sites observed. d, A cell doublet: when two cells (here, sperm 
DNA florets) are encapsulated together in the same droplet, their genomic 
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sequences will be tagged with the same barcode; such events must be 
recognized computationally and excluded from downstream analyses. e, Four 
example chromosomes froma cell barcode associated with two sperm cells 
(acell doublet). Black lines show haplotypes; blue circles are observations of 
alleles, shown on the haplotype from which they derive. Both parental 
haplotypes are present across regions of chromosomes for which the cells 
inherited different haplotypes. f, Computational recognition of cell doublets 
in Sperm-seq data (from an individual sperm donor, NC11). We used the 
proportion of consecutively observed SNP alleles derived from different 
parental haplotypes to identify cell doublets; this proportion is generally small 
(arising from sparse crossovers, PCR/sequencing errors, and/or ambient DNA) 
but is much higher when the analysed sequence comes froma mixture of two 
distinct haploid genomes. We use 21 of the 22 autosomes to calculate this 
proportion, excluding the autosome with the highest such proportion (given 
the possibility that achromosome is aneuploid). The dashed grey line marks 
the inflection point beyond whichsperm genomes are flagged as potential 
doublets and excluded from downstream analysis. Red points indicate 
barcodes with coverage of both X and Y chromosomes (potentially X + Y cell 
doublets or XY aneuploid cells); black points indicate barcodes with one sex 
chromosome detected (X or Y). The red (XY) cells below the doublet threshold 
are XY aneuploid but appear to have just one copy of each autosome. 
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Extended Data Fig. 3 | Identification and use of ‘bead doublets’. a, SNP 
alleles were inferred genome-wide (for each sperm genome) by imputation 
fromthe subset of alleles detected in each cell and by Sperm-seq-inferred 
parental haplotypes. For each pair of sperm genomes (cell barcodes), we 
estimated the proportion of all SNPs at which they shared the same imputed 
allele. Asmall but surprising number of such pairwise comparisons (19 of 
984,906 from the donor shown, NC14) indicates essentially identical genomes 
(ascertained through different SNPs). b, We hypothesize that this arises froma 
heretofore undescribed scenario that we call ‘bead doublets’, in which two 
barcoded beads have coencapsulated with the same gamete and whose 
barcodes therefore tagged the same haploid genome. c, Random pairs of cell 
barcodes (here 100 pairs selected from donor NC10) tend to investigate few of 
the same SNPs (left), and also tend to detect the same parental haplotype on 
average at the expected 50% of the genome (right). d, ‘Bead doublet’ barcode 
pairs (here 20 pairs from donor NC10, who had the median number of bead 
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doublets, left) also investigate few of the same SNPs, yet detect identical 
haplotypes throughout the genome (right). Results were consistent across 
donors. e, Use of ‘bead doublets’ to characterize the concordance of crossover 
inferences between distinct samplings of the same haploid genome by 
different barcodes. The bead doublets (barcode pairs) were compared to100 
random barcode pairs per donor. Crossover inferences were classified as 
‘concordant’ (overlapping, detected in both barcodes), as ‘one SNP apart’ 
(separated by just one SNP, detected in both barcodes), as ‘near end of 
coverage’ (within 15 heterozygous SNPs of the end of SNP coverage ata 
telomere, where the power to infer crossovers is partial), or as discordant. Error 
bars (with small magnitude) show binomial 95% confidence intervals for the 
number of crossovers per category divided by number of crossovers total in 
both barcodes (32,714 crossovers total in 1,201 bead doublet pairs; 67,862 
crossovers total in2,000 random barcode pairs; some barcodes are in multiple 
bead doublets or random barcode pairs). 
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Extended Data Fig. 4| Numbers and locations of crossovers called from 
downsampled data (equal number of SNPs in each cell, randomly chosen). 
To eliminate any potential effect of unequal sequence coverage across donors 
and cells, we used downsampling to create datasets with equal coverage 
(numbers) of heterozygous SNP observations in each cell. Crossovers were 
called from these random equally sized sets of SNPs fromall cells. 

a, b, Crossover number per cell globally (a) and per chromosome (b) (785,476 
total autosomal crossovers called from downsampled SNPs included, 30,778 
cells included, aneuploid chromosomes excluded). c, Density plots of 
crossover location with crossover midpoints plotted and area scaled to be 
equal to the per-chromosome crossover rate. Grey rectangles mark 
centromeric regions; coordinates are in hg38. d, Similar numbers of crossovers 
were called from full data and equally downsampled SNP data: we performed 
correlation tests across cells for each donor and chromosome to compare the 
number of crossovers called from all data to the number of crossovers called 
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Correlation (Pearson's r) between 
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from equal numbers of randomly downsampled SNPs. The histogram shows 
Pearson’s rvalues for all 460 (20 donors x 23 chromosomes (total number plus 
number for 22 autosomes)) tests (n per test = 974-2,274 cells per donor asin 
Extended Data Table 1; all chromosome comparisons Pearson’s r> 0.83; all 
two-sided P<10°). e, Crossovers called from equally downsampled SNP data 
were in similar locations to those called from all data: we performed correlation 
tests comparing crossover rates in 500-kb bins (centimorgans (cM) per 500 kb) 
from all data versus equally downsampled SNP data for each donor and 
chromosome. The histogram shows Pearson’s rvalues for all 460 (20 donors x 
23 chromosomes (genome-wide rate plus rate for 22 autosomes)) tests (n per 
test = number of 500-kb bins per chromosome (genome-wide: 5,739; 
chromosomes 1-22: 497, 484, 396, 380, 363, 341, 318, 290, 276, 267, 270, 266, 
228, 214, 203, 180, 166, 160, 117,128, 93, 101); all chromosome comparisons 
Pearson’sr> 0.87, all two-sided P<10°°°). 
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Extended Data Fig. 5 | Interindividual and intercell recombination rate 
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from single-sperm sequencing. a, Density plot showing the per-cell number 
of autosomal crossovers for all 31,228 cells (813,122 total autosomal 


crossovers) from 20 sperm donors (per-donor cell and crossover numbers asin 
Extended Data Table 1; aneuploid chromosomes were excluded from crossover 
analysis). Colours represent a donor’s mean crossover rate (crossovers per cell) 


from low (blue) to high (red). This same mean recombination rate derived 
colour schemeis used for donors in all figures. The recombination rate differs 
among donors (n= 20; Kruskal-Wallis chi-squared = 3,665; df=19; P<10°°°). 

b, Per-chromosome crossover number in each of the 20 sperm donors (data as 
ina but shown for individual chromosomes). c, Per-chromosome genetic map 


lengths for: each of the 20 sperm donors, as inferred from Sperm-seq data 
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(colours from blue to red reflect donors’ individual crossover rates asina);a 
male average, as estimated from pedigrees by deCODE’ (yellow triangles); and 
a population average (including female meioses, which have more crossovers), 


as estimated from HapMap data’ (yellow circles). The deCODE genetic maps 
stop 2.5 Mb from the ends of SNP coverage. d, Physical versus genetic distances 
(for individualized sperm donor genetic maps and deCODE’s paternal genetic 


map) plotted at 500-kb intervals (inhg38 coordinates). Grey boxes denote 
centromeric regions (or centromeres and acrocentric arms). Sperm-seq maps 
are broadly concordant with deCODE maps (see the correlation test results 

in Supplementary Notes), except at subtelomeric regions that are not included 


in deCODE’s map. 
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Extended Data Fig. 6| See next page for caption. 
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Extended Data Fig. 6 | Distributions of crossover locations along 
chromosomes (in ‘crossover zones’). a, Each donor’s crossover locations are 
plotted as acoloured line; the colour indicates the donor’s overall crossover 
rate (blue, low; red, high); grey boxes show the locations of centromeres (or, for 
acrocentric chromosomes, of centromeres and parms). We used the midpoint 
between the SNPs bounding each inferred crossover as the position for each 
crossover in all analyses. To combine data across chromosomes, we show 
crossover locations (density plot) on‘meta-chromosomes’ in which crossover 
locations are normalized to the length of the chromosome or arm on which 
they occurred. For acrocentric chromosomes, only the g arm was considered; 
for nonacrocentric chromosomes, the p and gq arms were afforded space on the 
basis of the proportion of the nonacrocentric genome (in base pairs) they 


comprise, with the centromere placed at the summed parms proportion of 
base pairs of these chromosomes. Crossover locations were first converted to 
the proportion of the arm at which they fall, and then these positions were 
normalized to the genome-wide p or garm proportion. b, Identification of 
chromosomal zones of recombination use (‘crossover zones’) fromall donors’ 
crossovers for 22 autosomes. Density plots are shown of crossover location for 
all sperm donors’ total 813,122 crossovers (aneuploid chromosomes excluded; 
the crossover location is the midpoint between SNPs bounding crossovers) 
along autosomes (hg38). Crossover zones (bounded by local minima of 
crossover density) are shown with alternating shades of grey. Diagonally 
hatched rectangles indicate centromeres (or centromeres and acrocentric 
arms). 
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Extended Data Fig. 7 | See next page for caption. 
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Extended Data Fig. 7 | Crossover placement in end zones, and crossover 
separation, varies in ways that correlate with crossover rate, among sperm 
donors and among individual gametes. Analyses are shown by donor (a-h; 
n=20 sperm donors) or by individual gamete (i,j, n= 31,228 gametes). Ina-h, 
the left panels show the phenotype distributions for individual donors, and the 
right panels show the relationship to the donors’ crossover rates. To control for 
the effect of the number of crossovers, the analyses inc, d and g-j use ‘two- 
crossover chromosomes’—chromosomes on which exactly two crossovers 
occurred. For scatter plots (a-h, right), all x-axes show the mean crossover rate 
andall error bars are 95% confidence intervals (y-axes are described per panel). 
a, b, Left, both the proportion of crossovers that falls inthe most distal 
chromosome crossover zones (a) and crossover separation (b; a readout of 
crossover interference, the distance between consecutive crossovers in Mb) 
vary among 20 sperm donors (proportion of crossovers in end per-cell 
distributions among-donor Kruskal-Wallis chi-squared = 2,334, df=19, 
P<10°°; all distances between consecutive crossovers among-donor Kruskal- 
Wallis chi-squared =3,309, df=19, P<10°°°). The right panels show both 
properties (y-axes, total proportion of crossovers in distal zones and median 
crossover separation, respectively) versus the donor’s crossover rate 
(correlation results for 20 sperm donors: proportion of all crossovers across 
cells in distal zones Pearson’s r=—0.95, two-sided P=2 x 10~°; Pearson’s 
r=-0.96, two-sided P=1x 10). c, Results obtained from an alternative method 
for calculating the proportion of crossovers in the distal regions of 
chromosomes. The proportion of crossovers in the distal 50% of chromosome 
arms varies across donors (left, among-donor Kruskal-Wallis chi-squared 
=2,209, df=19, P<10 °°) and negatively correlates with recombination rate 
(right, Pearson’s r=—0.92, two-sided P=2 x 10°; the y-axis shows the actual 
proportion of crossovers in the distal 50%). d, Asinc, but with the proportion of 
crossovers from two-crossover chromosomes occurring in the distal 50% of 
chromosomearms. Left, among-donor Kruskal-Wallis chi-squared =1,058, 


df=19, P=2x10"; right, correlation with recombination rate Pearson’s 
r=-0.93, two-sided P=4 x 10°. e, As in b but for consecutive crossovers onthe 
qarm of the chromosome. Left, among-donor Kruskal-Wallis chi-squared = 346, 
df=19, P=7x10~; right, correlation with recombination rate Pearson’s 
r=-0.90, two-sided P=5 x 10°. f, Asin b but for consecutive crossovers on 
opposite chromosome arms (that is, crossovers that span the centromere). 
Left, among-donor Kruskal-Wallis chi-squared =1,554, df=19, P=1<10°”; 
right, correlation with recombination rate Pearson’s r=-—0.96, two-sided 
P=3x10".g, As ine but for distances between consecutive crossovers on two- 
crossover chromosomes. Left, among-donor Kruskal-Wallis chi-squared = 181, 
df=19, P=2x10*; right, correlation with recombination rate Pearson’s 
r=-0.88, two-sided P=3 x10”.h, As inf but for distances between consecutive 
crossovers ontwo-crossover chromosomes. Left, among-donor Kruskal-Wallis 
chi-squared = 930, df=19, P=5 x10; right, correlation with recombination 
rate Pearson’s r=—0.92, two-sided P=1x 10°. i,j, Boxplots show medians and 
interquartile ranges with whiskers extending to 1.5 times the interquartile 
range from the box. Each point represents a cell. i, Within-donor percentiles 
showing the proportion of crossovers from two-crossover chromosomes that 
fallin distal zones, plotted against the crossover-rate decile. Groups are deciles 
of crossover rates normalized by converting each cell’s crossover count toa 
percentile within-donor (all cells from all donors shown together; ncells in 
deciles =3,152, 3,122, 3,276, 3,067, 3,080, 3,073, 3,135, 3,132, 3,090, 3,101, 
respectively (31,228 in total)). Because the initial data are proportions with 
small denominators, an integer effect is evident as pileups at certain values. 

j, Crossover interference from two-crossover chromosomes (showing the 
median consecutive crossover separation per cell). Each point represents the 
median ofall percentile-expressed distances between crossovers from all two- 
crossover chromosomes in one cell (percentile taken within-chromosome); 
groupings andnvaluesas ini. 
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Extended Data Fig. 8 | Crossover interference in individual sperm donors 
and onchromosomes. a, Solid lines show density plots (scaled by donor’s 
crossover rate) of the observed distance (separation) between consecutive 
crossovers as measured in the proportion of the chromosome separating them 
(left) and in genomic distance (right), with one line per donor (n=20). Dashed 
lines show the distance between consecutive crossovers when crossover 
locations are permuted randomly across cells to remove the effect of crossover 
interference. b, The median of observed distances between consecutive 
crossovers for one donor (NC18, who had the tenth lowest recombination rate 
of 20 donors; blue dashed line) is shown along witha histogram of the medians 
ofn=10,000 among-cell crossover permutations (in both cases, the 
permutation one-sided P-value is less than 0.0001). The units are the 
proportion of the chromosome (left) and genomic distance (in Mb, right). 

c, Crossover separation on example chromosomes; plots and n values are asin 
b. Permutation one-sided P< 0.0001 for all chromosomes in all sperm donors 
except occasionally for chromosome 21, where especially few double 


crossovers occur. d, Median distances between donor NC18’s consecutive 
crossovers for each autosome for all intercrossover distances (left two panels) 
and inter-crossover distances only from chromosomes with two crossovers 
(right two panels). Units are proportion of the chromosome or genomic 
distance. e, Diagram describing analysing crossover interference in 
individualized genetic distance (one 20-cM windowis shown), using a donor’s 
own recombination map. f, When parameterized using each donor’s own 
genetic map, sperm donors’ crossover interference profiles across multiple 
genetic distance windows (as shown ine) do not differ (n=20 sperm donors; 
Kruskal-Wallis chi-squared = 0.22; df=19; P=1, using 20 estimates (CM 
distances) for each of 20 donors). Error bars show binomial 95% confidence 
intervals onthe proportion of cells witha second crossover in the window 
given. This suggests that interindividual variation in crossover interference, 
although substantial when measured in base pairs, is negligible when 
measured in donor-specific genetic distance, pointing to a shared influence 
uponcrossover interference and crossover rate. 
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Extended Data Fig. 9 | Relationships of aneuploidy frequency to 
chromosome size and recombination. a. The across-donor per-cell 
frequency of chromosome losses (left) and gains (centre), plotted against the 
length of the chromosome (from reference genome hg38; for losses across 
n=22 chromosomes, Pearson’s r=—0.29, two-sided P= 0.19; and for gains 
across n=22 chromosomes, Pearson’s r=—0.23, two-sided P= 0.30). Right, the 
per-chromosome rate of losses exceeding gains (number of losses minus 
number of gains divided by number of cells) is plotted against the length of the 
chromosomes (across n= 22 chromosomes; Pearson’s r=—0.29, two-sided 
P=0.19). Red labels, acrocentric chromosomes. Error bars show 95% binomial 
confidence intervals on the per-cell frequency (number of events/number of 
cells, all 31,228 cells included). b-d, Relationship between aneuploidy 
frequency and recombination. Only autosomal whole-chromosome 
aneuploidies are included. b, Left, total number of crossovers on meiosis I 
nondisjoined chromosomes (blue line; chromosomes analysed, called as 
transitions between the presence of one haplotype and both haplotypes 
onthe gained chromosome) compared withn=10,000 donor- and 
chromosome-matched sets (35 x 2 chromosomes per Set) of properly 
segregated chromosomes (grey histogram; permutation). Fifty-four total 
crossovers on meiosis I gains versus 84.2 mean total crossovers on sets of 
matched chromosomes; one-sided permutation P< 0.0001, for the hypothesis 
that gained chromosomes have fewer crossovers. Right, as left but for gains 
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occurring during meiosis II (71 meiosis-II-derived gained chromosomes of one 
whole copy from all individuals with fewer than five crossovers called onthe 
gained chromosome). One-sided permutation P= 0.98 for meiosis II from 
n=10,000 permutations, for the hypothesis that gained chromosomes have 
fewer crossovers; sister chromatids nondisjoined in meiosis II capture all 
crossovers whereas matched chromosomes do not: matched simulations and 
homologues nondisjoined in meiosis I capture only arandom half of crossovers 
occurring on that chromosome in the parent spermatocyte. c, Crossovers per 
nonaneuploid megabase from each cell from each donor, split by aneuploidy 
status (n cells =498, 50, 92, 30,609, left to right; ‘euploid’ excludes cells with 
any autosomal whole- or partial-chromosomal loss or gain; ‘gains’ includes 
gains of one or more than one chromosome copy; Mann-Whitney test 
W=7,264,117, 722,191, 1,370,376; two-sided P=0.07, 0.49, 0.66 for all autosomal 
aneuploidies, meiosis I gains and meiosis II gains, respectively, all compared 
against euploid). Each cell is represented by one point; boxplots show medians 
and interquartile ranges with whiskers extending to1.5 times the interquartile 
range from the box. d, Per-cell crossover rates versus per-cell rates of 
aneuploidy (left, loss and gain; middle and right, gain only, as only chromosome 
gain meiotic division can be determined); n=20 donors (coloured by crossover 
rate). P-values shown are for two-sided Pearson’s correlation tests. Error bars 
represent 95% confidence intervals on the mean crossover rate (x-axis) andon 
the observed aneuploidy frequency (y-axis). 
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Extended Data Fig. 10| Additional examples of noncanonical aneuploidy c, A distinct, recurring triplication of much of chromosome 15, from around 
events detected with Sperm-seq. This figure includes those shown in Fig. 33 Mb onwards but not including the proximal part of the garm, also recurs in 
3f. Copy number, SNPs, haplotypes and centromeres are plotted as in Fig. 3a. cells from three donors. d, Chromosome-arm-level losses (top three panels) 
Donor and cellidentities are noted above each panel. Coordinates are inthe and gains (including in more than one copy, bottom three panels, anda 
reference genome hg38. a, b, Chromosomes 2, 20, 21 (a) and 15 (b) are compound gain of thep arm and loss of the garm, top panel). 


sometimes present in three copies in an otherwise haploid sperm cell. 
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Extended Data Fig. 11| Single-cell and person-to-person variationin 
diverse meiotic phenotypes may be governed by variation in the physical 
compaction of chromosomes during meiosis. Previous work showed that the 
physical length of the same chromosome varies among spermatocytes at the 
pachytene stage of meiosis, probably by differential looping of DNA along the 
meiotic chromosome axis (for example, the left column shows smaller loops, 
resulting in more loopsin total and ina greater total axis length compared with 
the right column, with larger loops)’*’?. This physical chromosome length is 
correlated across chromosomes among cells from the same individual", and 
correlates with crossover number ?0!42.73.76, This length—measured as the 
length of the chromosome axis or of the synaptonemal complex (the connector 
of homologous chromosomes)—can vary by two or more fold among a human’s 
spermatocytes”. We propose that the same process differs on average across 
individuals and may substantially explain interindividual variationin 
recombination rate. On average, individual 1 (left) would have meiotic 
chromosomes that are physically longer (less compacted) in an average cell 
than individual 2 (right); one example chromosome is shown in the figure. After 
the first crossover onachromosome (probably ina distal region ofa 
chromosome, where synapsis typically begins in male human meiosis before 
spreading across the whole chromosome"), crossover interference prevents 


nearby double-strand breaks (DSBs) from becoming crossovers; however, DSBs 
that are far away can become crossovers (which themselves also cause 
interference). More DSBs are probably created on physically longer 
chromosomes, and crossover interference occurs among noncrossover as well 
as crossover DSBs”. Crossover interference occurs over relatively fixed 
physical (micrometre) distances** *°”*; these distances encompass different 
genomic (Mb) lengths of DNA in different cells or on average in different people 
owing to variable compaction. Thus, crossover interference tends toleadtoa 
different total number of crossovers as a function of the degree of compaction, 
resulting in the observed negative correlation (Fig. 2c, e) of crossover rate with 
crossover spacing (as measured in base pairs). Given that the first crossover 
probably occurs ina distal region of the chromosome, this model canalso 
explain the negative correlation (Fig. 2b, d) between crossover rate and the 
proportion of crossovers at chromosome ends. This figure shows the total 
number of crossovers, crossover interference extent, and crossover locations 
for both sister chromatids of each homologue combined; in reality, these 
crossovers are distributed among the sister chromatids, making these 
relationships harder to detect in daughter sperm cells and requiring large 
numbers of observations to make relationships among these phenotypes clear. 


Extended Data Table 1| Sperm donor and single-sperm sequencing characteristics and results 


Cells Unique Sex 
(number Genome heterozygous Resolution Autosomal chromosome 
excluding Readsper covered Heterozygous SNP alleles Crossovers of aneuploidy —_ aneuploidy 
cell and cell per cell SNPs in observed per observed Crossovers crossovers events events 
bead (median, (median, genome cell (median, (total, per cell (kb, (percent of (percent of 

Donor Ancestry* doublets) thousands) percent) (millions) thousands) thousands) (mean) median) cells)t cells)t 
Overall -- 31,228 2118 1.0! -- 24.68 813+ 26.11! 240! 7.68 0.95 
NC1 Eur. 982 284 1.4 1.95 31.6 26 26.31 189 1.5 0.6 
NC2 Eur. 1,680 163 0.8 1.98 18.2 37 22.19 307 2.0 0.7 
NC3 Eur. 1,289 190 0.9 1.94 21.5 36 28.13 260 Tf 0.7 
NC4 Eur. 1,482 243 11 1.98 26.8 40 26.98 243 11 0.5 
NC6 Afr. Am. 1,370 154 0.8 2.53 23.8 38 27.57 253 0.7 0.3 
NC8 As. 1,663 304 1.5 1.81 30.9 45 26.98 229 3.1 0.5 
NCQ As. 1,894 245 1.2 1.79 25.6 53 27.98 231 0.8 1:8: 
NC10_ As. 1,154 224 11 1.82 23.3 29 24.99 257 1.5 0.3 
NC11 Eur. 1,930 202 1.0 1.92 22.8 50 25.82 242 1.3 0.4 
NC12 Eur. 2,145 179 0.9 1.91 20.6 51 23.76 270 1.2 1.7 
NC13 Eur. 1,514 259 1.2 1.92 28.3 4 27.19 202 0.9 1.0 
NC14_— Eur. 1,336 296 1.4 1.92 32.4 36 26.65 175 2.5 1.2 
NC15 — Eur. 1,702 211 1.0 1.93 23.2 42 24.80 268 1.0 0.9 
NC16 Eur. 1,785 241 1.2 1.92 26.9 42 23.78 227 2.2 1.3 
NC17 Eur. 1,504 220 1.0 1.94 23.8 39 25.92 250 2.1 0.7 
NC18 Eur. 1,589 170 0.8 1.93 18.4 42 26.48 317 1.2 0.6 
NC22 Afr. Am. 1,693 195 0.9 2.53 29.7 44 25.96 205 1.4 0.7 
NC25 Afr. Am. 2,274 175 0.8 2.47 25.8 62 27.31 211 2.8 1.8 
NC26 Afr. Am., As. 974 120 0.6 2.55 18.0 26 26.67 355 1.3 0.4 
NC27___ As. (?) 1,268 267 1.3 1.96 29.2 34 26.80 199 1.7 0.6 


«Ancestry as provided by the sperm bank. Afr. Am., of African American ancestry; Eur., of European ancestry; As., of Asian ancestry; (?), conflicting ancestry information given. 
‘These numbers are the total number of aneuploidy events divided by the total number of cells multiplied by 100; cells can have more than one event. 

‘Sum across all cells from all sperm donors. 

S5Median or mean across all individual cells from all sperm donors (31,228 measurements summarized). 

Median or mean of aggregate metrics across samples (20 measurements summarized). 

‘Median across all crossovers (813,122 measurements summarized). 
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Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


x| The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


x A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


[x | A description of all covariates tested 


x A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


x] A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
i AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


[x] For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
— Give P values as exact values whenever suitable. 


* 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


*« 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


x Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection All data was collected via an Illumina NovaSeq, which generated sequencing BCL files. All subsequent data processing is outlined below. 


Data analysis Picard Tools v2.2 was used for sequence data processing (http://broadinstitute.github.io/picard/). BWA-MEM v0.7.7-r441 was used for 
alignment (http://bio-bwa.sourceforge.net/). GATK v3.7 was used to call variants (https://gatk.broadinstitute.org/). Custom code in Drop- 
seq Tools v2.2 was used to format sequencing data to include single-cell barcode information and to call heterozygous data in sperm 
genomes (https://github.com/broadinstitute/Drop-seq/releases). Genome STRiP v2.0 was used to ascertain read depth for sperm cells 
(http://software.broadinstitute.org/software/genomestrip/). HapCUT v1 was used for phasing (https://github.com/vibansal/hapcut). 
Custom code was written in R (www.r-project.org) for this study for data processing and to call and analyze recombination and 
aneuploidy events; it is available via Zenodo at http://dx.doi.org/10.5281/zenodo.2581595 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 
- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Crossover and aneuploidy data (individual events and counts per donor and/or cell), including source data underlying Figs. 2, 3b-e and Extended Data Figs. 5-9, are 
available via Zenodo, http://dx.doi.org/10.5281/zenodo.2581570. Raw sequence data are available in the SRA via dbGaP for general research use upon application 
and approval (study accession number phs001887.v1.p1). 
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Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


[x | Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size We aimed to sequence ~1,000 sperm cells per individual donor to be able to detect rare events (aneuploidy occurs at a frequency of less than 
a few percent), and the actual number of cells sequenced per donor was a result of experimental fluctuations during library preparation. With 
this amount of sequencing per donor as a baseline, we then chose to sequence cells from 20 individuals to be able to observe inter-individual 
differences and to be able to detect trends with moderate power (for example, an analysis of 20 individuals is 830% powered to detect a 
Pearson's r of 0.58 at p = 0.05). 


Data exclusions No data were excluded from the analyses. For certain analyses, subsets of the data were used to examine questions about those subsets. 


Replication As this was a completely new data set, we did not seek to replicate our findings with a data set of the same type. We did compare our findings 
of crossover location and frequency to published estimates generated with other methods; our observations were consistent with earlier 
estimates (detailed in manuscript). 


Randomization N/A- we did not have experimental groups. 


Blinding N/A - we did not have experimental groups. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
x Antibodies x ChIP-seq 
x Eukaryotic cell lines x Flow cytometry 
x Palaeontology x MRI-based neuroimaging 
x Animals and other organisms 
[x Human research participants 
x]|[_] Clinical data 
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Human research participants 


Policy information about studies involving human research participants 


Population characteristics The 20 sperm sample biospecimens analyzed in this study came from from 20 anonymous, karyotypically normal sperm donors 
at the New England Cryogenic Center (NECC). Per sperm bank policy, donors are 18-38 years old at the time of donation; precise 
age of donors is not released to researchers. The donors are known only to NECC and not to the researchers. Donor identifiers 
used in the paper were created specifically for this study and are not linked to any external identifiers. 


Recruitment The biospecimens used were discarded samples from New England Cryogenic Center; the samples had been collected during 
sperm donation. Donors consented, at the time of donation, for biospecimens to also be used for research purposes. NECC 
routinely provides such samples to researchers for research that has been IRB-approved at the researchers' home institution. 
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The sperm donors might be subject to self-selection biases such as need for extra income and free time availability for donation, 
though any specific biases are unknown. If recombination or aneuploidy were correlated with any sperm donor recruitment or 
self-selection biases, the results of this study would reflect these underlying correlates. For example, the results are only 
reflective of donors aged 18-38 who had enough free time to act as sperm donors; if individuals of older ages or less free time 
had different meiotic phenotypes, they would not be captured in this study. 


Ethics oversight The Harvard Faculty of Medicine Office of Human Research Administration reviewed the research protocols (protocols 
M23743-101 and IRB16-0834) and determined that this research was "Not human subject research", based on the use of 
discarded biospecimens and the fact that researchers did not have interactions with the biospecimen donors. Harvard has also 
reviewed and approved our deposition of the data into an NIH respository; dbGaP will assure that access is provided only to 
researchers with legitimate research uses who agree not to try to re-identify donors based on data in the study. 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 
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Cancers arise through the acquisition of oncogenic mutations and grow by clonal 
expansion’”. Here we reveal that most mutagenic DNA lesions are not resolved into a 
mutated DNA base pair within a single cell cycle. Instead, DNA lesions segregate, 
unrepaired, into daughter cells for multiple cell generations, resulting in the 
chromosome-:scale phasing of subsequent mutations. We characterize this process in 
mutagen-induced mouse liver tumours and showthat DNA replication across 
persisting lesions can produce multiple alternative alleles in successive cell divisions, 
thereby generating both multiallelic and combinatorial genetic diversity. The phasing 
of lesions enables accurate measurement of strand-biased repair processes, 
quantification of oncogenic selection and fine mapping of sister-chromatid-exchange 


events. Finally, we demonstrate that lesion segregation is a unifying property of 
exogenous mutagens, including UV light and chemotherapy agents in human cells 
and tumours, which has profound implications for the evolution and adaptation of 


cancer genomes. 


Analysis of cancer genomes has led to the identification of many 
driver mutations and mutation signatures’ that illustrate how envi- 
ronmental mutagens cause genetic damage and increase cancer risk*>. 
The numerous patterns of mutations identified in cancer genomes 
reflects the temporal and spatial heterogeneity of exogenous and 
endogenous exposures, mutational processes and germline varia- 
tion among patients. A study of diverse human cancers identified 49 
distinct single-base-substitution signatures, with almost all tumours 
showing evidence of at least three such signatures?. 

This intrinsic heterogeneity leads to overlapping mutation signa- 
tures that make it difficult to accurately disentangle the biases of DNA 
damage and repair, or to interpret the dynamics of clonal evolution. 
We reasoned that a more controlled and genetically uniform cancer 
model system would overcome some of these limitations. By effectively 
re-running cancer evolution hundreds of times, we aimed to explore 
oncogenesis and mutation patterns at high resolution and with good 
statistical power. 

We chemically induced liver tumours in postnatal day 15 (P15) male 
C3H/HeOuJ inbred mice (hereafter referred to as C3H mice) (Fig. 1a; 
n=104) using a single dose of diethyInitrosamine (DEN)°. For com- 
parison and validation, we replicated the study inthe divergent mouse 


strain CAST/EiJ’ (hereafter referred to as CAST mice) (Extended Data 
Fig. 1;n=54). 

Whole-genome sequencing (WGS) of 371 independently-evolved 
tumours from 104 C3H mice (Supplementary Table 1) revealed that each 
genome had about 60,000 (approximately 13 per Mb) somatic point 
mutations (Extended Data Fig. 1a), a similar level to that found in human 
cancers caused by exogenous mutagens such as tobacco’ and UV expo- 
sure’. Insertion-deletion mutations and larger segmental changes were 
rare (Extended Data Fig. la—-f). Point mutations were predominantly 
(76%) T>N or their complement A>N changes (where N represents any 
other nucleotide; Fig. 1b, Extended Data Fig. 1g-j), consistent with the 
long-lived thymine adduct O*-ethyl-deoxythymidine being the principal 
mutagenic lesion”. Known driver mutations were in the EGFR-RAS-RAF 
pathway"? (Fig. 1c) and usually mutually exclusive. Similar results 
were replicated in CAST mice (Extended Data Fig. 1)). 


Chromosome-:scale segregation of lesions 

In each tumour, we observed multimegabase genomic segments with 
pronounced Watson-versus-Crick-strand asymmetry of mutations, 
frequently encompassing entire chromosomes (Fig. 2). We define 
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Fig. 1| DEN-initiated tumours have a high burden of T>N and A>N mutations 
and driver changes in the EGFR-RAS-RAF pathway. a, P15 male C3H mice 
received a single dose of DEN; 371 tumours were isolated 25 weeks later (P190) 
and analysed by histopathology and WGS. b, Aggregated mutations showing the 
distribution of nucleotide substitutions; every fourth trinucleotide contextis 
displayed (x-axis). c, Each tumour is shownas acolumn with its mutation rate (4) 
per million base pairs (Mb) (black) and driver mutations (brown boxes). 


Watson-strand bias as an excess of T>N over A>N mutations when called 
onthe forward strand of the reference genome, and Crick-strand bias 
as the converse of this. With a median span of 55 Mb (Fig. 2a—d), these 
asymmetrically mutated segments are orders of magnitude longer than 
asymmetries generated by transcription-coupled nucleotide-excision 
repair (TCR)", APOBEC mutagenesis’ or replication biases”"*. Total 
mutation load and DNA copy number remain uniform across the 
genome (Fig. 2e, f). 

Pervasive, strand-asymmetric mutagenesis can be explained by 
DEN-induced lesions remaining unrepaired before genome replica- 
tion. The first round of replication after DEN treatment results in two 
sister chromatids with independent lesions on each parent strand, 
and daughter strands containing misincorporation errors comple- 
mentary to the lesions (Fig. 2i). The sister chromatids segregate into 
separate daughter cells during mitosis, and lesion-mutation duplexes 
are resolved into a mutated DNA base pair by later replication cycles. 
Asymmetric regions show a 23-fold excess (median) of their preferred 
mutation over its reverse complement, thus more than 95% of lesions 
that generate a mutation segregate for at least one mitotic division. 
We subsequently refer to this phenomenon as ‘lesion segregation’. 

The haploid X chromosome always contains segments with a strong 
strand bias (Fig. 2g). On autosomal chromosomes, when both allelic 
copies have the same bias, the genome shows that bias (for example, 
Watson bias onchromosome 15 in Fig. 2a-d); when one copy has Watson 
bias and the other has Crick bias, the chromosome appears unbiased 
(for example, chromosome 19 in Fig. 2a—d). Amodel based onrandom 
retention of Watson- or Crick-biased chromosomes accurately predicts 
that (1) around 50% of the autosomal genome and (2) 100% of the hap- 
loid X chromosome show mutational asymmetry (Fig. 2g, Extended 
Data Fig. 2). A few tumours (3.5%) have absent or muted asymmetry; 
cellularity estimates indicate that they are polyclonal or polyploid 
(Supplementary Table 1). 


Resolving sister-chromatid exchange 


The lesion segregation model predicts that mutational asym- 
metries should span whole chromosomes. However, we observe 
symmetry switches between multimegabase segments of Wat- 
son and Crick bias within chromosomes (Fig. 2a-d, g). These 
probably represent sister-chromatid exchanges (SCEs) from 
homologous-recombination-mediated DNA repair” (Extended Data 
Fig. 4a). SCEs are typically invisible to sequencing technologies because 
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homologous recombination between sister chromatids is thought to 
be error-free’. 

SCE frequency per tumour positively correlates with point mutation 
rate (Extended Data Fig. 3a, b). With about 27 SCEs (median) in each 
tumour genome (n=371), we had sufficient statistical power to detect 
recurrent exchange sites and biases in genomic context (Extended Data 
Fig. 3c, d). After removing three reference-genome misassemblies 
(Fig. 2g, Extended Data Fig. 3e, f), we found that SCEs occur with mod- 
est enrichment in transcriptionally inactive, late-replicating regions 
(Extended Data Fig. 4b). The fine mapping (approximately 20-kb reso- 
lution) of SCEs enabled us to test the fidelity of homologous recom- 
bination. The mutation rate appears locally elevated at SCEs, but the 
mutational spectrum matches the rest of the genome (Extended Data 
Fig. 4c-f). A model of Holliday-intermediate branch migration could 
explain these observations (Extended Data Fig. 4g). 


Lesion segregation reveals selection 


Cumulatively the tumours have equal Watson and Crick lesion-strand 
retention across most of the genome (Fig. 2h). However, we observe 
striking deviations at loci spanning known driver genes (Fig. 2h). The 
TA mutation at codon 584 of the Braf driver gene’ is observed in 153 
out of 371 tumours in C3H mice, and we would expect the surround- 
ing chromosomal segment to retain T lesions on the same strand. 
This is the case in 94% (144 out of 153) of tumours (Fisher’s exact test, 
P=3.6 x10"). By contrast, tumours lacking the Braf mutation do not 
show aretention bias (47% Crick, 53% Watson; P= 0.88, not rejecting the 
50:50 null expectation). We applied this test for oncogenic selection 
at sites with sufficient recurrent mutations to have statistical power, 
which confirmed that there was significant oncogenic selectionin Hras, 
Brafand Egfr (Fig. 1c, Extended Data Table 1). 


DNA repair with lesion-strand resolution 


Resolving DNA lesions to specific strands within a single cell cycle pre- 
sents a unique opportunity to investigate strand-specific DNA damage 
and repair in vivo. For example, TCR (Fig. 3a) specifically removes DNA 
lesions from the RNA template strand”””°. 

We generated transcriptomes from the tissue of origin at the devel- 
opmental time of DEN mutagenesis. Mutation rates were calculated 
for each gene in each tumour, stratified by both expression level and 
the strand containing lesions (Fig. 3b). As expected, TCR was highly 
specific to the template strand and correlated closely with gene expres- 
sion. The mutation rate in non-expressed genes had no observable 
transcription-strand bias. By contrast, mutations in highly expressed 
genes were reduced by 79.8 + 1.0% (mean + s.d.) if the tumour had 
template-strand lesions. 

To evaluate the specificity of TCR, we compared mutation rates 
for each trinucleotide context between template and non-template 
strands, stratified by expression level (Fig. 3c). For highly expressed 
genes, thymines have an 82 + 6.8% (mean + s.d. across sequence con- 
texts) lower mutation rate on the template strand; the non-template 
mutation rate is indifferent to expression (Fig. 3c, dark blue lines are 
close to vertical), as expected”. Mutations from C and G showhighly effi- 
cient TCR onthe template strand; 70 + 7.8% and 34 + 21%, respectively. 
In contrast to T mutations, they also show an expression-dependent 
reduction in mutation rate on the non-template strand, suggesting 
that non-TCR repair processes are active. Rare mutations from adenine 
on the lesion-containing strand increase with transcription, possibly 
owing to activity of error-prone trans-lesion DNA polymerase Pol-n”. 

The ability to resolve the lesion strand unmasks the contribution of 
bidirectional transcription from active promoters” in shaping muta- 
tion patterns (Fig. 3d—f, Extended Data Fig. 5). Genic transcription is 
associated with a sharp, sustained reduction in mutation rate from 
template-strand lesions. A local increase in the mutation rate over 
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Fig. 2|Chromosome-:scale and strand asymmetric segregation of DNA 
lesions. a-f, An example DEN-induced C3H tumour (identifier: 94315_N8). 
a-c, Mutational asymmetry. Individual T>N mutations shownas blue (a; 
Watson strand) and gold (c; Crick strand) points; the y-axis shows distance to 
the nearest same-strand T>N mutation; nt, nucleotide. b, Segmentation of 
mutation strand asymmetry patterns. The y-axis shows degree of asymmetry 
(grey indicates no bias). Red indicates symmetry switches. d, Asymmetric 
segments shownas ribbon plot. e, Mutation rate in 10-Mb windows; blue line 
shows genome-wide average. f, DNA copy number in 10-Mb windows (grey) and 
for eachasymmetric segment (black). g, Ribbon plots (as in d) for 371C3H 
tumours ranked by X chromosome asymmetry. Purple triangle indicates the 
example tumour depicted ina-f. Grey diamonds mark reference genome 
misassemblies. h, Driver genes distort the balance of Watson and Crick 
asymmetries (see Methods). i, Mechanistic model of lesion segregation. (1) A 
mutagen generates lesions (red triangles) on both DNA strands. (2) If not 
removed, lesions will segregate into sister chromatids: one carrying only 
Watson-strand lesions (blue) and the second carrying only Crick-strand lesions 
(gold). Postmitotic daughter cells will have independent lesions and resulting 
replication errors (3), resolved into full mutations in later replication (4). 

(5) Only lineages containing driver changes (* in (1)) will expand into substantial 
populations. 


approximately 200 nucleotides upstream of the transcription start 
site (Fig. 3d) is revealed to result from genic and upstream bidirectional 
transcription emerging from opposite edges of the core promoter” lead- 
ing to alocal depletion of TCR activity within the promoter (Fig. 3e, f). 
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Fig. 3 | Identification of the lesion-containing DNA strand enables TCRto 
be quantified with strand specificity. a, TCR of DNA lesions is expected to 
reduce the mutation rate only when lesions are on the template strand of an 
expressed gene. Black spot shows RNA polymerase and black line the nascent 
RNAtranscript. b, TCR of template-strand lesions is dependent on 
transcription level (P15 liver, median transcripts per million (TPM)). Estimates 
of mutation rate (circles) are the aggregate rates for expression level of binned 
genes across C3H tumours (n =371). Expression level bin 0 contains n=2,835 
genes, all subsequent bins contain n=4,351+1genes (inclusion criteria, 

see Methods); empirical confidence intervals (99%) were calculated 

through bootstrap sampling (n =100 replicates) of genes within each bin. 

c, Comparison of template versus non-template mutation rates for the 64 
trinucleotide contexts: each context has a high and alow expression point 
linked by aline. d, Sequence-composition-normalized profiles of mutation rate 
around transcription start sites (TSS). e, Stratifying by lesion strand reveals 
how bidirectional transcription initiation shapes the observed mutation 
patterns. f, Higher resolution of the TSS region frome. 


Anengine for genetic diversity 

Asegregating lesion may act as template for multiple rounds of repli- 
cation in successive cell cycles (Fig. 2i). Each replication could incor- 
porate different incorrectly or correctly paired nucleotides opposite 
a persistent lesion, resulting in multiple alleles at the same position. 
Consistent with this notion, multiallelic mutations have been reported 
in human cancers” and a cell-lineage-tracking system”. 

We evaluated multiallelic variation by identifying sites with mul- 
tiple high-confidence—but conflicting—mutation calls. On average, 
8% of mutated sites in DEN-induced tumours have multiallelic vari- 
ants (n=1.8 x 10° sites in C3H tumours); per tumour, this value ranges 
from less than 1% to 26% (Fig. 4a). As a control, only 0.098% (95% con- 
fidence interval: 0.043-0.25%) of sites permuted between tumours 
show evidence of non-reference nucleotides. We further validated 
WGS multiallelic-variant calls using independently performed exome 
sequencing? (Fig. 4b). 
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Fig. 4| Lesion segregation generates multiallelic and combinatorial 
genetic diversity. a, Mutation sites per tumour with robust support for 
multiallelic variation; the grey line shows the median; null expectation is from 
permutation between tumours. b, Validation rate for mutations from WGS in 
independent whole-exome sequencing (WES); n=15 tumours, collectively with 
n=20,683 WGS mutations meeting inclusion criteria (Methods). Curves show 
validation rate stratified by WGS read support. Empirical 95% confidence 
interval (CI) from100 bootstrap samplings of the aggregated WGS mutations. 
The null expectation permuted tumour identity between WGS and WES. 

c, Sequence reads spanning proximal mutated sites. d, Asc, but showing 
combinatorial diversity between a pair of biallelic sites.e, Correlation between 
per-tumour multiallelic rate and combinatorially diverse mutation pairs 

(as inc, d), with one point per tumour. f, Tree of all possible progeny ofa 
DEN-mutagenized cell for ten generations. Blue and gold lines trace simulated 


The generation of multiallelic variation produces combinatorial 
genetic diversity that would not be expected under purely clonal expan- 
sion. This can be directly visualized in pairs of mutations spanned by 
individual sequencing reads (Fig. 4c, d). The observed combinations of 
biallelic sites require replication over lesions without the generation of 
mutations insome cell divisions (Fig. 4d). This directly demonstrates 
that non-mutagenic synthesis over DNA lesions occurs, and allele fre- 
quency analysis indicates it is common (Extended Data Fig. 6). The 
per-tumour rates of combinatorial diversity and multiallelic sites corre- 
late closely and highlight the wide variation between tumours (Fig. 4e). 

The explanation for such intertumour variance becomes evident 
when plotting the distribution of multiallelic sites along each genome 
(Fig. 4f-i). Tumours with high rates of genetic diversity have consist- 
ently high rates of multiallelism throughout their genome (Fig. 4g). 
They are likely to have expanded from a first-generation daughter 
of the original DEN-mutagenized cell, in which all DNA is a duplex 
of alesion-containing and non-lesion-containing strand. Therefore, 
replication using lesion-containing strands as the template in subse- 
quent generations produces multiallelic variation uniformly across the 
genome. Tumours with lower total levels of genetic diversity exhibit 
discrete genomic segments of high and low multiallelism (Fig. 4h, i). 
These tumours probably developed from a cell some generations after 
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segregation of lesion-containing strands froma single haploid chromosome. 
Coloured nodes show hypothetical transformed daughter lineages with their 
multiallelic patterns (right). g-i, Mutation asymmetry summary ribbons for 
example C3H tumours that show high (g), variable (h) or low (i) rates of genetic 
diversity. The percentage of mutation sites with robust support for multiallelic 
variation calculated in 10-Mb windows (grey) and for each asymmetric segment 
(black).j, Histogram of the estimated cell generation post-DEN exposure from 
which tumours developed based onthe proportion of multiallelic segments. 

k, Enrichment of specific driver gene mutations in earlier (generation 1) and 
later (generation >1) transforming tumours. log, odds ratios (circles) from 
Fisher’s exact test with 95% confidence intervals (whiskers) calculated from the 
hypergeometric distribution. All n=371 tumours were included in the analysis 
for eachgene. 


DEN treatment. Each mitosis following DEN exposure is expected to 
dilute the number of lesion-containing strands in each daughter cell 
by approximately 50%. Only lesion-retaining fractions of the genome 
generate multiallelic and combinatorial genetic diversity in the daugh- 
ter lineages; consistent with this, the multiallelic segments mirror the 
mutational asymmetry segmentation pattern. 

By estimating the fraction of multiallelic chromosomal segments, 
we can infer the cell generation, relative to DEN exposure, from which 
the tumour expanded (Fig. 4j). In 67% of C3H tumours and 21% of CAST 
tumours, the initial burst of mutations was instantly transformative. In 
the remainder of tumours, the observed fractions of multiallelic seg- 
ments cluster around expectations for subsequent cell generations, 
suggesting that transformation required a specific combination of 
mutated alleles, an additional mutation or an external trigger. Of note, 
Egfr-driven tumours appear to transform significantly later (P=0.042 
after Bonferroni correction, Fisher’s exact test), suggesting that driver 
gene identity influences the timing of tumour inception (Fig. 4k). 


Lesion segregation is ubiquitous 
Lesion segregation is a feature of DEN mutagenesis in mice. This raises 
two critical questions. Do other DNA-damaging agents induce lesion 
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Fig. 5 | Lesion segregation is a pervasive feature of exogenous mutagens 
and is evident inhuman cancers. a, The runs-based rl,) metric, calculated for 
an example simulated solar radiation (SSR) clone (Extended Data Fig. 7a); 

20% of informative mutations (C>T or their complement G>A) arein strand 
asymmetric runs of at least 22 consecutive mutations (for example, >22 C>T 
mutations without an intervening G>A). Simulated null based on 100,000 
permutations of 1,000 mutations; black curve shows median. b, All robust 
mutagens in human iPSCs*, mutagen classes indicated by coloured boxes; 
PAH indicates polycyclic aromatic hydrocarbons. Individual compound 
abbreviations expanded inSupplementary Table 2. The rl,, metric (x-axis) is 
plotted for each clone (n=325), including multiple replicates per exposure. 
Data point size quantifies informative mutations; “P< 0.05 (two-sided, 
Bonferroni-corrected).c, The rl,, metric and runs tests for human cancers”; 
n=18,850 cancers screened, three cohorts plotted. Blue lines show Bonferroni 
adjusted P= 0.05 threshold for the runs test (two-sided) and an empirical 
threshold for rl,) (Methods). x-axis P-values <1 x 10 are rank-ordered. 

d, Mutational asymmetry (plotted as in Fig. 2a—c) ina human hepatocellular 
carcinoma (donor DO231953) with a dominant mutation signature for 
aristolochic acid exposure. 


segregation? Does lesion segregation occur in human cells and cancers? 
Recently, a study in which human induced pluripotent stem cells (iPSCs) 
were exposed to 79 environmental mutagens revealed that 41 of the 
mutagens produced excess nucleotide substitutions>. Although not 
previously noted in these in vitro data, many of the exposures generated 
chromosome-:scale lesion segregation patterns (Extended Data Fig. 7) 
similar to those observed in the in vivo DEN model. Applying runs-based 
tests (Fig. 5a, b, Extended Data Fig. 8), we detect marked mutational 
asymmetry in every sample with more than 1,000 ‘informative’ muta- 
tions (Fig. 5b, Extended Data Fig. 8b; see Methods), including clinically 
relevant insults such as sunlight (simulated solar radiation), tobacco 
smoke (benzo[a]pyrene diol-epoxide (BPDE)) and chemotherapeutics 
(temozolomide). By contrast, mutations induced by perturbation of 
replication and repair pathways” independent of DNA lesions showed 


no detectable asymmetry, as expected (Extended Data Fig. 8c). We 
conclude that the chromosome-scale segregation of lesions and the 
resulting strand asymmetry of mutation patterns, are general features 
of all tested DNA-damaging mutagens. 

The pronounced mutational asymmetry observed in both 
DEN-induced tumours and mutagen-exposed human iPSCs° occurs 
after asingle mutagenic insult. By contrast, most human cancers accu- 
mulate mutations as a result of multiple damaging events over their 
history. Lesion segregation predicts that such tumours will acquire 
new waves of segregating lesions after each exposure, thus progres- 
sively masking their asymmetry patterns. Therefore, eventhough UV 
exposure causes substantial lesion segregation in human cells (Fig. 5a, 
b, Extended Data Figs. 7a, 8b), itis unlikely that skin cancers would show 
mutational asymmetry following repeated UV exposure. 

Nevertheless, analysis of human cancer genomes” (n = 18,850 
tumours, 22 primary sites) identified multiple cancers with the char- 
acteristic mutational asymmetry of lesion segregation (Fig. 5c, d). The 
majority of these tumours are renal, hepatic or biliary in origin, and 
showa high mutation rate and strand asymmetry of T>A or their com- 
plement A>T mutations, consistent with exposure to aristolochic acid? 
(Supplementary Table 2). Although it is seen most clearly in tumours 
subjected to a single dose of a mutagen, lesion segregation probably 
shapes all genomes subjected to DNA damage, withimportant implica- 
tions for tumour evolution and heterogeneity. 


Discussion 


Inthis study, we have shown that most mutation-causing DNA lesions 
are not resolved as mutations within a single cell cycle. Instead, lesions 
segregate unrepaired into daughter cells for multiple cellular genera- 
tions, resulting inchromosome-:scale strand asymmetry of subsequent 
mutations. This suggests that lesion removal before replication has 
high fidelity and rarely results in mutations. Lesion segregation was 
initially discovered in an in vivo mouse model of oncogenesis; we 
have demonstrated that it is ubiquitous for all tested mutagens, also 
occurs in human cells and is evident in human cancers. Similar patterns 
of asymmetry in bacterial mutagenesis suggest that the underlying 
mechanisms are highly conserved”®”’. 

Our discovery of lesion segregation challenges longstanding assump- 
tions of cancer evolution*°. For example, the widely used infinite sites 
model” does not allow for recurrent mutation at the same site. Our 
findings also provide new perspectives for understanding cancer evolu- 
tion using mutational asymmetry and multiallelism patterns to track 
events during oncogenesis and to quantify selection. Perhaps most 
notably, lesion segregation is a previously unrecognized mechanism 
for a cancer to sample the fitness effects of mutation combinations, 
thus evading Muller’s ratchet” and Hill-Robertson interference, which 
assumes low selection efficiency owing to the inability to separate 
mutations of opposing fitness****. Consequently, DNA-damaging 
chemotherapeutics, particularly large or closely spaced doses gener- 
ating persistent lesions, could inadvertently provide an opportunity 
for cancer to efficiently select resulting mutations. This insight may 
guide the development of more effective chemotherapeutic regimens. 

Once identified, lesion segregation is a deeply intuitive concept. Its 
practical applications provide new vistas for the exploration of genome 
maintenance and fundamental molecular biology. The discovery of 
pervasive lesion segregation profoundly revises our understanding of 
how the architecture of DNA repair and clonal proliferation can conspire 
to shape the cancer genome. 
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Methods 


Statistical methods were used to predetermine sample size for the test- 
ing of oncogenic selection by biased strand retention; otherwise, no 
statistical methods were used to determine sample size. The investiga- 
tors were blinded to allocation during histopathological assessment. 


Mouse colony management 

Animal experimentation was carried out in accordance with the Ani- 
mals (Scientific Procedures) Act 1986 (United Kingdom) and with the 
approval of the Cancer Research UK Cambridge Institute Animal Wel- 
fare and Ethical Review Body (AWERB): the maximum approved tumour 
burden was 10% body weight, which was not exceeded. Animals were 
maintained using standard husbandry: mice were group housed in Tec- 
niplast GMSOO IVC cages witha 12 h:12 h light:dark cycle and ad libitum 
access to water, food (LabDiet 5058), and environmental enrichments. 


Chemical model of hepatocarcinogenesis 

P15 male C3H and CAST mice were treated witha single intraperitoneal 
(IP) injection of DEN (Sigma-Aldrich NO258; 20 mg kg body weight) 
diluted in 0.85% saline. Liver tumour samples were collected from 
DEN-treated mice 25 weeks (C3H) or 38 weeks (CAST) after treatment. 
All macroscopically identified tumours were isolated and processed 
in parallel for DNA extraction and histopathological examination. 
Non-tumour tissue from untreated P15 mice (ear, tail, and background 
liver) was sampled for control experiments. 


Tissue collection and processing 

Liver tumours of sufficient size (=>2 mm diameter) were bisected; one 
half was flash frozen in liquid nitrogen and stored at —-80 °C for DNA 
extraction, and the other half was processed for histology. Tissue 
samples for histology were fixed in 10% neutral buffered formalin for 
24h, transferred to 70% ethanol, machine processed (Leica ASP300 
Tissue Processor; Leica), and paraffin embedded. All formalin-fixed 
paraffin-embedded sections were 3 pm in thickness. 


Histochemical staining 

Formalin-fixed paraffin-embedded tissue sections were stained with 
haematoxylin and eosin (H&E) using standard laboratory techniques. 
Histochemical staining was performed using the automated Leica 
ST5020; mounting was performed on the Leica CV5030. 


Imaging 

Tissue sections were digitised using the Aperio XT system (Leica Biosys- 
tems) at 20x resolution; all H&E images are available in the BioStudies 
archive at EMBL-EBI under accession S-BSST383. 


Tumour histopathology 

H&E sections of liver tumours were blinded and assessed twice by a 
pathologist (S.J.A.); discordant results were reviewed by an independ- 
ent hepatobiliary pathologist (S.E.D.). Tumours were classified accord- 
ing to the International Harmonization of Nomenclature and Diagnostic 
Criteria (INHAND) guidelines for lesions in rats and mice®. In addition, 
tumour grade, size, morphological subtype, nature of steatosis and 
mitotic index were assessed (Supplementary Table 1), as well as the 
presence of cystic change, haemorrhage, necrosis, or vascular invasion. 


Sample selection for WGS 

Tumours which met the following histological criteria were selected 
for WGS (C3H n = 371, CAST n= 84): (i) diagnosis of either dysplastic 
nodule (DN) or hepatocellular carcinoma (HCC), (ii) homogenous 
tumour morphology, (iii) tumour cell percentage >70%, and (iv) ade- 
quate tissue for DNA extraction. Neoplasms with extensive necrosis, 
mixed tumour types, a nodule-in-nodule appearance (indicative of 
an HCC arising within a DN), or contamination by normal liver tissue 


were excluded. Since carcinogen-induced tumours arising inthe same 
liver are independent®, multiple tumours were selected from each 
mouse to minimise the number of animals used. A subset of normal 
(non-tumour) samples from untreated mice were also sequenced (C3H 
n=13, CAST n=7). 


Whole-genome sequencing 

Genomic DNA was isolated from liver tissue and liver tumours using 
the AllPrep 96 DNA/RNA Kit (Qiagen, 80311) according to the manu- 
facturer’s instructions. DNA quality was assessed on a 1% agarose gel 
and quantified using the Quant-IT dsDNA Broad Range Kit (Thermo 
Fisher Scientific). Genomic DNA was sheared using a Covaris LE220 
focused-ultrasonicator to a450-bp mean insert size. 

WGS libraries were generated from 1 pg of 50 ng pl high molecular 
weight genomic DNA using the TruSeq PCR-free Library Prep Kit (Illu- 
mina), according to the manufacturer’s instructions. Library fragment 
size was determined using a Caliper GX Touch with a HT DNA 1k/12K/ 
Hi Sensitivity LabChip and HT DNA Hi Sensitivity Reagent Kit to ensure 
fragments of 300-800 bp (target ~450 bp). 

Libraries were quantified by real-time PCR using the Kapa library 
quantification kit (Kapa Biosystems) ona Roche LightCycler 480. 0.75 
nM libraries were pooled in 6-plex and sequenced on a HiSeq X Ten 
(Illumina) to produce paired-end 150-bp reads. Each pool of 6 libraries 
was sequenced over eight lanes (minimum of 40x coverage). 


Variant calling and somatic mutation filtering 

Sequencing reads were aligned to respective genome assemblies 
(C3H = C3H He] v1; CAST = CAST _EiJ_v1)*° with bwa-mem (v.0.7.12)”” 
using default parameters. Reads were annotated to read groups using 
the Picard (v.1.124)*8 tool AddOrReplaceReadGroups, and minor 
annotation inconsistencies corrected using the Picard CleanSam and 
FixMateInformation tools. Bam files were merged as necessary, and 
duplicate reads were annotated using the Picard tool MarkDuplicates. 

Single-nucleotide variants were called using Strelka2 (v.2.8.4)°” 
implementing default parameters. Initial variant annotation was per- 
formed with the GATK (v.3.8.0)*° walker CalculateSNVMetrics (https:// 
github.com/crukci-bioinformatics/gatk-tools). Genotype calls witha 
variant allele frequency <0.025 were removed. Although inbred strains 
were used, fixed genetic differences between the colonies and the refer- 
ence genome, as well as small numbers of germline variants segregating 
within the colonies were identified. For each strain, fixed differences 
identified as homozygous changes present in 100% of genotyped sam- 
ples were filtered out. Segregating variants were filtered based onthe 
excess clustering of mutations to animals with shared mothers. To 
generate a null expectation taking into account the family structure 
of the colonies, the parent-offspring relationships were randomly 
permuted 1,000 times. For each count of recurrent mutation (range 
5 to 371 inclusive), we determined the null distribution of expected 
distinct mothers. Comparing this to the observed count of distinct 
mothers for each recurrent (n > 4) mutation, those with a low prob- 
ability (P< 1x10, pnorm function from R (v.3.5.1)*!) under the null 
were excluded from analyses. 

Copy number variation between tumours within strains was called 
using CNVkit (v.0.9.6). Non-tumour reference coverage was provided 
from non-tumour control WGS data (C3H n=11, CAST n=7) and per 
tumour cellularity estimates (see below) were provided. 


RNA sequencing 

Total RNA was extracted from P15 liver tissue (n =4 biological replicates 
per strain) using QIAZol Lysis Reagent (Qiagen), according to manu- 
facturer’s instructions. DNase treatment and removal were performed 
using the TURBO DNA-free kit (Ambion, Life Technologies), according 
tothe manufacturer’s instructions. RNA concentration was measured 
using a NanoDrop spectrophotometer (Thermo Fisher); RNA integrity 
was assessed on a Total RNA Nano Chip Bioanalyzer (Agilent). 


Article 


Total RNA (111g) was used to generate sequencing libraries using the 
TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Gold (Illu- 
mina), according to the manufacturer’s instructions. Library fragment 
size was determined using a 2100 Bioanalyzer (Agilent). Libraries were 
quantified by quantitative PCR (Kapa Biosystems). Pooled libraries 
were sequenced on a HiSeq4000 to produce = 40 million paired-end 
150-bp reads per library. 


RNA-seq data processing and analysis 

Transcript abundances were quantified with Kallisto (v.0.43.1)* (using 
the flag-bias) and a transcriptome index compiled from coding and 
non-coding cDNA sequences defined in Ensembl v91**. TPM estimates 
were generated for each annotated transcript and summed across 
alternate transcripts of the same gene for gene-level analysis. The TSS 
for each gene was annotated with Ensembl v91 and based upon the most 
abundantly expressed transcript. RNA-sequencing (RNA-seq) data are 
available at Array Express at EMBL-EBI under accession E-MTAB-8518. 
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Genomic annotation data 
Mouseliver proximity ligation sequencing (HiC) data were downloaded 
from GEO (GSE65126)*, replicates were combined, then aligned to 
GRCm38* and processed using the Juicebox (v.7.5) and Juicer scripts” 
to obtain the HiC matrix. Eigenvectors were obtained for 500kb con- 
secutive genomic windows over each chromosome from the HiC matrix 
using Juicebox and subsequently oriented (to distinguish compartment 
A from B) using GC content per 500-kb bin. We used progressiveCac- 
tus*® to project the 500-kb windows into the C3H reference genome 
and Bedtools (v.2.28.0) to merge syntenic loci between 450 and 550 
kbin size, removing the second instance where we observed overlaps. 
Genic annotation was obtained from Ensembl v.91“ for the corre- 
sponding C3H and CAST reference genome assemblies (C3H_HeJ_v1, 
CAST_EiJ_v1). Genomic repeat elements were annotated using Repeat- 
Masker (v.20170127; http://www.repeatmasker.org) with the default 
parameters and libraries for mouse annotation. 


The analysable fraction of the genome 

Analysis and sequence composition calculations were confined to 
the main chromosome assemblies of the reference genome (chromo- 
somes 1-19 and X). Using WGS of non-tumour liver, ear and tail samples 
(C3Hn=11, CAST n=7) collected and sequenced contemporaneously 
with tumour samples, genome sequencing coverage was calculated 
for 1-kb windows using multicov in Bedtools (v.2.28.0)”. Windows 
with read coverage >2 s.d. from the autosomal mean were flagged as 
suspect in each tumour. Read coverage over the X chromosome was 
doubled in these calculations to account for the expected hemizygosity 
in these male mice. Any 1-kb window identified as suspect in >90% of 
these non-tumour samples was flagged as ‘abnormal read coverage’ 
(ARC) and masked from subsequent analysis. This masked 12.7% of 
the C3H and 11.5% of the CAST reference genomes yielding analys- 
able haploid genome sizes of C3H = 2,333,783,789 nucleotides (nt) and 
CAST = 2,331,370,397 nt. 


Mutation rate calculations 

Mutation rates were calculated as 192 category vectors representing 
every possible single-nucleotide substitution conditioned on the iden- 
tity of the upstream and downstream nucleotides. Each rate being 
the observed count of a mutation category divided by the count of 
the trinucleotide context in the analysed sequence. To report a single 
aggregate mutation rate, the three rates for each trinucleotide context 
were summed to give a 64 category vector and the weighted mean 
of that vector reported as the mutation rate. The vector of weights 
being the trinucleotide sequence frequency of a reference sequence, 
for example the composition of the whole genome. In the case of 
whole-genome analysis, the same trinucleotide counts are used in (1) 
the individual category rates calculation and (2) the weighted mean of 


the rates, cancelling out. For windowed comparisons of mutation rates, 
the weighted mean is calculated using the genome wide composition of 
trinucleotides rather than the local sequence composition, providing 
acompositionally adjusted mutation rate estimate. For mutation rates 
in TCR analysis, the same compositional adjustment was carried out 
but using the trinucleotide composition of the aggregate genic spans 
of genome (minus ARC regions) for normalization. 


Mutation signatures 

The 96 category ‘folded’ mutation counts for each of the 371 C3H 
tumours were deconvolved into the best fitting number (K) of com- 
ponent signatures using sigFit (v.2.0)°° with 1,000 iterations and K set 
tointegers 2 to 8 inclusive. A heuristic goodness-of-fit score based on 
cosine similarity favoured instances where K = 2. The DEN1 and DEN2 
signatures reported were obtained by running sigFit with 30,000 itera- 
tions for K=2. Analysis of CAST tumours gave less distinct separation 
of signatures so the C3H derived DEN1 and DEN2 were used for both 
strains. To fit signatures to each tumour we used sigFit provided with 
the DEN signatures and additional SPONT1 and SPONT2 signatures 
that were derived from equivalent WGS analysis of spontaneous 
(non-DEN-induced) C3H tumours. 


Driver mutation identification 

Candidate cancer driver genes were identified by applying Onco- 
driveFML (v.2.2.0 using the SIFT scoring scheme)” and Oncodrive- 
CLUSTL (v.1.1.1)” to mutations identified in C3H tumours. The only 
genes convincingly identified as significantly enriched for function- 
ally impactful or clustered mutations were Braf, Egfr and Hras. Kras 
appeared as marginally significant. These four genes were identified 
for C3H®. Protein altering mutations in those genes were annotated as 
driver mutations in C3H and CAST tumours. 


Mutational asymmetry segmentation and scoring 

For each tumour a focal subset of ‘informative’ mutation types were 
defined, T>N or A>N mutations, in the case of DEN-induced tumours. 
The order of focal mutations along each chromosome was represented 
as a binary vector (for example, 0 for T>N, 1 for A>N). Vectors corre- 
sponding to each chromosome of each tumour were processed with 
the cpt.mean function of the R Changepoint (v.2.2.2)* package run 
with an Akaike information criterion (AIC) penalty function, maximum 
number of changepoints set to 12 (Q=12), and implementing the PELT 
algorithm for optimal changepoint detection. Following segmenta- 
tion, the defined segments were scored for strand asymmetry, taking 
into account the sequence composition of the segment. For example 
in tumours with T>N or A>N informative mutations the number of 
Ts on the forward strand is the count of Watson sites Gy and the num- 
ber of T>N mutations is 4 which together give the Watson strand 
rate Rw = w/Gy. The forward strand count of As and mutations from 
A likewise give the Crick strand rate R, = [i-/G.. From these two rates 
we calculate a relative difference metric, the mutational asymmetry 
score S=(Ry-R-)/(Rw+Rc). 

The parameter S scales from 1 all Watson (for example, DEN T>N 
mutations) through 0 (50:50 T>N:A>N) to -1 for all Crick (for example, 
DEN AN). For the categorical assignment, S > 0.3 is Watson-strand 
asymmetric, S < -—0.3 Crick-strand asymmetric and in the range —0.3 < 
S<0.3 symmetric, though more stringent filtering was applied where 
noted. Segments containing <20 informative mutations were discarded 
from subsequent analyses. 

To test for oncogenic selection at sites with recurrent mutations, 
mutational asymmetry segments overlapping the focal mutation were 
categorised based on their asymmetry score S, as above. The test was 
implemented as a Fisher’s exact test with the 2 x 2 contingency table 
comprising the counts chromosomes (two autosomes per cell) strati- 
fied by Watson-versus-Crick asymmetry and the presence of the focal 
mutation in the tumour. Tumours containing another known driver 


gene or recurrent mutation within the focal asymmetry segment were 
discarded from the analysis. We estimated the minimum recurrence of 
a mutation necessary to reliably detect oncogenic selection through 
simulation. Biased segregation of chromosomes containing drivers was 
modelled using the observed median excess of T>N over A>N lesions 
(23-fold), and random segregation of non-driver containing strands (1:1 
ratio). Our model predicted >33 C3H recurrences or >41 CAST recur- 
rences would give 80% power to detect oncogenic selection if present. 


Tumour cellularity estimates 

We calculated tumour cellularity as a function of the non-reference 
read count in autosomal chromosomes (1 — R/d) x 2, where R is the 
reference read count at a mutated site and dis the total read depth 
at the site. For each tumour these values were binned in percentiles 
and the midpoint of the most populated (modal) percentile taken as 
the estimated cellularity of the tumour. Given the low rate of copy 
number variation across the DEN induced tumours, no correction was 
made for copy-number distortion. Skew in the variant allele frequency 
(VAF) = (1— R/d)) distribution was calculated using Pearson’s median 
skewness coefficient implemented in R as (3 x (mean — median))/s.d. 
of the VAF distribution. 


Identifying and filtering reference genome misassemblies 

Since lesion segregation, mutation asymmetry patterns allow the 
long-range phasing of chromosome strands, they can detect discrep- 
ancies in sequence order and orientation between the sequenced 
genomes and the reference. We identified autosomal asymmetry 
segments that immediately transitioned from Watson bias (S > 0.3) to 
Crick (S<—0.3) or vice versa without occupying the intermediate unbi- 
ased state (-0.3 < S< 0.3); such discordant segments are unexpected. 
Allowing for +100 kb uncertainty in the position of each exchange site 
we produced the discordant segment coverage metric. At sites with 
discordant segment coverage >1 we calculated percentage consensus 
for misassembly M = ds/(ds+cs) where ds is the number of discordant 
segments over the exchange site and cs the number of concordant: 
where either Watson or Crick mutational asymmetry extends at least 
1x 10° nucleotides on both sides of the exchange site. The approximate 
genomic coordinates for a C3H strain specific inversion on chromo- 
some 6 have been previously reported™. 


SCE-site analysis 

Identified SCE sites were aggregated across tumours from each strain. 
Exchange sites within 1 x 10° nt of known and proposed reference 
genome misassembly sites were excluded from analysis. The mid-point 
between the flanking informative mutations was taken as the reference 
genome position of the exchange event, and the distance between those 
flanking mutations as the positional uncertainty of the estimate. To 
generate null expectations for mutation rate measures, the coordinate 
of an exchange was projected into the genome of a proxy tumour and 
the mutation rates and patterns measured from that proxy tumour 
(repeated 100 times). The permutation of tumour identifiers for the 
selection of proxy tumours was a shuffle without replacement that 
preserved the total number of exchange sites measured in each tumour. 

The comparison of mutation spectra between windows was calcu- 
lated as the cosine distance between the 96 category trinucleotide 
context mutation spectra for the whole genome and that calculated for 
the aggregated 5-kb window. The 96 categories were equally weighted 
for this comparison. 

Exchange site enrichment analysis used Bedtools” shuffle to permute 
the genomic positions of exchange sites into the analysable fraction of 
the genome (defined above). Observed rates of annotation overlap were 
compared to the distribution of values from 1,000 permuted exchange 
sites. For genic overlaps we used Ensembl v.91“ coordinates for genic 
spans; gene expression status was based on the summed expression 
over all annotated transcripts for the gene from P15 liver from the 


matched mouse strain. Expression thresholds were defined as >50th 
centile for active and <5Oth centile for inactive genes. 

A higher count of informative mutations provides greater power 
toidentify shorter mutational asymmetry segments. To fairly test for 
correlation between nucleotide substitution rate and SCE rate we ran- 
domly down-sampled informative mutations to 10,000 per tumour 
genome and recomputed the mutational asymmetry segmentation 
patterns from the sampled data. Tumours with <10,000 informative 
mutations were excluded. We then correlated the total (not downsam- 
pled) nucleotide substitution load to the count of SCE events inferred 
from the down-sampled data. 


TCR calculations 

For each protein coding gene, the maximally expressed transcript 
isoform was identified from P15 liver in the matched strain (TPM 
expression), subsequently the primary transcripts. In the case of ties, 
transcript selection was arbitrary. Genes were partitioned into five 
categories based on the expression of the primary transcript: expres- 
sion level 0 (<O.0001 TPM) and four quartiles of detected expression. 

Using the segmental asymmetry patterns of each tumour and the 
annotated coordinates (Ensembl v.91) of the selected transcripts, we 
identified transcripts completely contained in a single Watson or Crick 
asymmetric segment and located at least 200 kb from the segment 
boundary at both ends. We also applied strict asymmetry criteria of 
mutational asymmetry scores S > 0.8 for Watson and S < -0.8 for Crick 
asymmetry segments, though analysis with the standard asymmetry 
thresholds and no segment boundary margin give similar results and 
identical conclusions. For each transcript in each tumour we then used 
boththe transcriptional orientation of the gene and the mutational asym- 
metry of the segment containing it to resolve the segregated lesions to 
either the template (anti-sense) or non-template (sense) strand of the 
gene. Transcripts contained in mutationally symmetric regions or not 
meeting the strict filtering criteria were excluded from analysis. 

We then analysed mutation rates stratifying by gene expression level 
and the template/non-template strand of the lesions but aggregating 
between tumours within the same strain. The TSS coordinates used 
correspond to the annotated 5’ end of the primary transcripts. 


Multiallelic variation 

Aligned reads spanning genomic positions of somatic mutations were 
re-genotyped using Samtools mpileup (v.1.9). Genotypes supported 
by >2 reads with a nucleotide quality score of >20 were reported, con- 
sidering sites with two alleles as biallelic, those with three or four alleles 
as multiallelic. The fraction of called mutations exhibiting multiallelic 
variation was calculated for the analysable fraction of the genome, 
across 1OMb consecutive windows and also for each of the mutational 
asymmetry segments calculated for each tumour. 

Anullexpectation for the multiallelic rate estimate was generated per 
C3H tumour; genomic positions identified as mutated across the other 
370 tumours were down-sampled to match the mutation count in the 
focal tumour. Any of these proxy mutation sites with a non-reference 
genotype supported by > 2 reads and nucleotide quality score > 20 at 
the focal site were referred to as ‘multiallelic’ for the purposes of defin- 
ing a background expectation for the calling of multiallelic variation. 
For each tumour, this was repeated 100 times and the mean reported. 

We used WES of 15 C3H tumours from prior work° that have sub- 
sequently been used to generate WGS data in this study as a basis for 
validating multiallelic calls. Multiallelic variant positions derived from 
WGS were genotyped in WES using Samtools mpileup, as described 
above. Only sites with > 30x WES coverage were considered and alleles 
were found to be concordant if a WGS genotype was supported by >1 
read in the WES data. To provide a null expectation, the analysis was 
repeated using WES data froma different tumour and validation rates 
reported for all versus all combinations of mismatched WGS-WES 
pairs (n=15*-15=210). 
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To quantify combinatorial genetic diversity for each tumour, pairs 
of mutations located between 3 and 150 nt apart were phased using 
sequencing reads that traversed both mutation sites. Distinct allelic 
combinations were counted after extraction with Samtools mpileup 
using only reads with nucleotide quality score > 20 over both muta- 
tion sites. 


Estimating the cell generation of transformation 

Knowing the fraction of lesion segregation segments that generated 
multiallelic variation across a tumour genome allows the inference 
of the generation time post-mutagenesis of the cell from which the 
tumour developed, because each successive cell generation is expected 
to retain only 50% of the lesion containing segments. We estimate this 
fraction as follows. Let p denote the fraction of multiallelic segments 
and let q be its complement, thatis, the fraction of non-multiallelic seg- 
ments, for each tumour genome. Segment boundaries being SCE sites 
or chromosome boundaries. In order to determine p, we re-purpose 
the quadratic Hardy-Weinberg equation: p+ g= p+ 2pq+q’=1, which 
holds since the two possible fractions need to sum to unity. Given an 
asymmetric segment of interest in the diploid genome, there are three 
distinct scenarios: (i) both chromosomes are multiallelic (p’), (ii) One 
of the chromosomes is multiallelic and the other is not (pg + gp) and 
(iii) both chromosomes are non-multiallelic (q”). The first two scenarios 
are not distinguishable from the data as both appear multiallelic (m). 
However, in the third scenario, for a segment to be non-multiallelic 
(biallelic, b), both chromosomal copies have to be non-multiallelic. 
As described below, qg’ can be estimated directly from the data and is 
subsequently used to estimate P= 1 - v(q’) and hence the cell genera- 
tion number of transformation post-mutagenesis. 

The estimation of qg? requires computing the ratio q’ = b/(b+ m). We 
can directly observe the counts of bas non-multiallelic segments. The 
number of autosomal chromosome pairs (n = 19) and count of SCE 
events (x) give the total number of segments inthe genomeb+m=ntx. 
Exchange events are not expected to align between allelic chromo- 
somes which will result in the partial overlap of segments between 
allelic copies. Although this increases the number of observed seg- 
ments (band m) relative to actual segments, assuming the independent 
behaviour of allelic chromosomes and that segment length is independ- 
ent of multiallelic state, this partial overlap does not systematically 
distort the quantification of b or the estimation of q’. 

To callanon-multiallelic segment (b) we require less than 4% multi- 
allelic sites. The threshold is based on the tri-modal frequency dis- 
tribution of multiallelic rates per segment, aggregated over all 371 
C3H tumours. The 4% threshold separates the lower distribution of 
multiallelic rates from the mid and higher distributions. 

To test for the enrichment of specific driver gene mutations in early 
generation versus late generation transformation post-DEN treatment, 
we applied Fisher’s exact test (fisher.test function in R) to compare 
the generation 1 ratio of tumours with, versus those without a focal 
mutation, to the same ratio for tumours inferred to have transformed 
inalater generation. We additionally report the same odds ratios, but 
requiring that the “with focal mutation” tumours had a driver mutation 
in only one of the driver genes: Hras, Braf, or Egfr. 


Cell-line and human cancer mutation analysis 

Somatic mutation calls were obtained from DNA maintenance and 
repair pathway perturbed human cells”. Of the 128,054 reported single 
nucleotide variants, 6,587 unique mutations (genomic site and specific 
change) were shared between two or more sister clones, so probably 
represent mutations present but not detected in the parental clone. All 
occurrences of the shared mutations were filtered out leaving 106,688 
mutations for analysis, although the inclusion of these filtered muta- 
tions does not alter any conclusions drawn. Somatic mutation calls 
from mutagen exposed cells° were obtained, no additional filtering 
was applied to these sub-clone mutations. 


Somatic mutation calls from the International Cancer Genome Con- 
sortium (ICGC)* were obtained as simple_somatic_mutation.open.* 
files from release 28 of the consortium, one file for each project. These 
somatic mutations have been called from a mixture of WGS and WES. 
Of the 18,965 patients represented (and not embargoed in the release 
28 data set), 116 were excluded from analysis; these represent a distinct 
WES subset of the LICA-CN project that appear to show a processing 
artefact in the distribution of specific mutation subsets. ICGC muta- 
tions were filtered to remove insertion and deletion mutations and also 
filtered for redundancy so that each mutation was only reported once 
for each patient. Mutation signatures deconvolution was performed 
using the R MutationPatterns (v.1.4.2)” package and COSMIC signature 
22 was interpreted as aristolochic acid’. 


Therl,, metric and runs tests 

Amongst only the informative mutations (for example, T>N/A>N in 
DEN) three consecutive T>N without an intervening A>N is arun of 
three. The R function rle was used to encode the run-lengths for binary 
vectors of informative mutations along the genome ofa focal tumour. 
Ranking them from the longest to the shortest run, we find the set of 
longest runs that encompass 20% of all informative mutations in the 
tumour. The run-length of the shortest of those is reported as the rl,, 
metric. The threshold percent of mutations was defined as having to 
be less than 50%, as on average only 50% of the autosomal genomes are 
expected to show mutational asymmetry patterns. On testing with rand- 
omized data, the value of 20% gave a stable null expectation (maximum 
observed value of a run of five when simulating 10,000 informative 
mutations) and still encompassed a large fraction of the informative 
mutations. All rl,, results reported were implemented so that runs 
were broken when crossing chromosome boundaries. To define an 
empirical significance threshold for genomes with fewer mutations, 
we simulated 1,000 random informative mutations 100,000 times, 
>99.995% simulations had rl,, <5 and 100% rl, < 6. 

The Wald-Wolfowitz runs test was performed using the runs.test 
function of the R randtests (v.1.0)°8 library. It was applied to binary vec- 
tors of informative changes as described above, with threshold = 0.5. 

The Wald-Wolfowitz runs test significance is inflated by coordinated 
dinucleotide changes, such as those produced by UV light exposure 
and also other local mutational asymmetries suchas replication asym- 
metry” and kataegis events’. The rl,, metric appears robust to most 
such distortions but we find it efficiently detects kataegis events that 
arein an otherwise mutationally quiet background, as is often the case 
for breast cancer. For this reason we also indicate the total genomic 
span of mutations in the rl,) subset of mutation runs: kataegis events 
typically span a tiny (<5%) fraction of the whole genome. 


Key resources 

The key reagents and resources required to replicate our study are 
listed in Supplementary Table 3. For externally sourced data, where 
applicable, URLs that we used can be found inthe Git repository https:// 
git.ecdf.ed.ac.uk/taylor-lab/Ice-Is. 

Primary data processing was performed in shell-scripted environ- 
ments calling the software indicated. Except where otherwise noted, 
analysis processing post-variant calling was performed in a Conda 
environment and choreographed with Snakemake running in an LSF 
batch control system (Supplementary Table 3). 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


The WGS FASTQ files are available from the European Nucleotide 
Archive (ENA) under accession number PRJEB37808. RNA-seq files are 


available from Array Express under experiment number E-MTAB-8518. 
Digitised histology images are available from Biostudies under acces- 
sion S-BSST383. 


Code availability 


The analysis pipeline including Conda and Snakemake configuration 
files can be obtained without restriction from the repository https:// 
git.ecdf.ed.ac.uk/taylor-lab/Ice-ls. 
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Extended Data Fig. 1| Summary mutation metrics for C3H and CAST 
tumours. a, Single nucleotide substitution rates per C3H tumour, rank ordered 
over x-axis (grey points, median blue line). Insertion/deletion (indel, <11 nt) 
rates showas black. b, Y-axis froma, expanded to show distribution of indel 
rates with preserved tumour order. c, Number of C3H copy number variant 
(CNV) segments and their total span asa percent of the haploid genome. Blue 
shading shows intensity of overlapping points as a percent of all tumours in the 
plot. d-f, Corresponding plots for CAST derived tumours; f, two extreme x-axis 
outliers relocated (red) and x-axis value shown. g, h, Mutation spectra 


deconvolved from the aggregate spectra of 371 C3H tumours, subsequently 
referred to as the DEN1and DEN2 signatures. DEN1is dominated by T>N or their 
complement A>N changes thought to arise from the O*-ethyl-deoxythymidine 
adduct of T°. DEN2 substitutions are primarily C>T or their complement G>A 
changes likely from O°-ethyl-2-deoxyguanosine lesions of G’°. i, Oncoplot 
summarizing mutation load, mutation signature composition, and driver gene 
mutation complement of C3H tumours. j, Oncoplot of CAST derived tumours 
as ini. The DEN2 signature is aminor component of most tumours but 
prominentina minority (i,j). 
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Extended Data Fig. 2 | Mutational asymmetry across 50% of the autosomal tumours, n=371.b, CAST tumours, n=84.c, d, Typically 100% of the haploid 
genome and 100% of the haploid X chromosomes. a, b, Typically 50% of the X chromosome shows Watson or Crick strand mutational asymmetry. 
autosomal genomic span (percent of nucleotides) in tumours is contained in c, C3H tumours (n=371). d, CAST tumours (n= 84). 

segments with either Watson or Crick strand mutational asymmetry. a, C3H 
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Extended Data Fig. 3 | The frequency of SCEs correlates with mutation rate, 
and localizing reference genome assembly errors. a, The relationship 
between single nucleotide substitution mutation load and detected SCE events 
in C3H tumours. DEN is known to produce ethyl aducts on the sugar-phosphate 
backbone of DNAas wellas mutation-inducing modifications to the bases” 
which could lead to strand breaks” triggering SCE. The frequent observation 
and correlation between rates of SCE and point mutation supports this view. 
Counts of SCE (y-axis) are based on down-sampling to 10,000 informative 
mutations per tumour to ensure equal power to detect SCE in each tumour. 
Tumours with <50% cellularity (pink) have high mutation load and forma 
sub-group with few detected SCE events; these are suspected to be polyclonal 
tumours and were excluded from the Pearson’s correlation reported (n=335 
independent tumour samples, implemented ina two-sided test, significance 
from Fisher’s transform). b, As fora, but showing CAST derived tumours (n= 84, 
after cellularity exclusions n=77).c, Evaluation of the relationship between 
mutation load and ability to detect SCE events. Mutations from C3H tumour 
94315_N8 (shown in Fig. 2) randomly down-sampled and segmentation analysis 
applied. The y-axis shows the percentage of SCE events detected (100 
replicates, mean red, 95% C.I. pink). The x-axis is on alog-scale: 95% of C3H and 
>95% of CAST tumours have mutation counts to the right of the blue vertical 
line. Down-sampling other tumours gave comparable results. d, The same 
down-sampling data as shown inc but the y-axis shows the percent of 
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mutations with the correct (same as full data) mutational asymmetry 
assignment (mean red, 95% C.I. pink). e, Candidate C3H reference genome 
assembly errors. Genome coordinates shown on the x-axis. Immediate 
switches between Watson and Crick asymmetry are not expected on 
autosomes unless both copies of the chromosome have a SCE event at 
equivalent sites. However, inversions and translocations between the 
sequenced genomes and the reference assembly are expected to produce 
immediate asymmetry switches. The discordant segment coverage count 
(black y-axis) shows the number of informative tumours (those with either 
Watson or Crick strand asymmetry at the corresponding genome position) that 
suggest atumour genome to reference genome discrepancy. Consensus 
support (brown y-axis) plotted as triangles shows the percentage of 
informative tumours that support a genomic discrepancy at the indicated 
position (only shown for values >50% support). The two sites on chromosome 6 
in C3H correspond toa previously identified C3H strain specific inversion that 
is knownto be incorrectly oriented in the C3H reference assembly™. 

f, Candidate CAST reference genome assembly errors, plotted as per e. The 
candidate misassembly on chromosome 14 in both strains occurs at an 
approximately orthologous position, suggesting a rearrangement shared 
between strains or a misassembly in the BL6 GRCm38 reference assembly 
against which other mouse reference genome assemblies have been 
scaffolded. 
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Extended Data Fig. 4| See next page for caption. 


Article 


Extended Data Fig. 4| Locally elevated mutation load is driven by SCE. 

a, Double strand breaks (DSBs) and other DNA damage can trigger 
homologous-recombination-mediated DNA repair between sister chromatids. 
The repair intermediate resolves into separate chromatids through cleavage 
and ligation; grey triangles denote cleavage sites for one of the possible 
resolutions that would result ina large-scale SCE event. Although illustrated for 
double-ended DNA breaks, single ended breaks from collapsed replication 
forks can be repaired through homologous recombination and could similarly 
lead to the formation of repair intermediate structures that can be resolved as 
SCEs. b, Enrichment analysis of SCE sites (red) compared with null expectations 
from randomly permuting locations into the analysable fraction of the genome 
(grey distributions), the black boxes denote 95% of 1,000 permutations. SCE 
events are enriched in later replicating and transcriptionally less active 
genomic regions (Hi-C defined compartment B), and correspondingly 
depleted from early replicating active regions. c, Aggregating acrossn=9,645 
SCE sites, the observed mutation rate approximately doubles at the inferred 
site of exchange (x=0). Aggregate mutation rates (brown) were calculated in 
consecutive 5-kb windows. Compositionally matched null expectation was 
generated by permuting each exchange site into 100 proxy tumours and 


calculating median (black) and 95% confidence intervals (grey) while 
preserving the total number of projected sites per proxy tumour. d, The 
elevated mutation count is not the result of a high mutation density ina subset 
of exchange sites, rather it isa subtle increase in mutations across most 
exchange sites. Heatmap showing mutation counts calculated in consecutive 
5-kb windows across each exchange site. Rows represent each exchange site, 
rank-ordered by total mutation count across each 400-kb interval. e, The 
distribution of positional uncertainty in exchange site location approximately 
mirrors the decay profile of elevated mutation frequency. f, Divergence of 
mutation rate spectra is shown as cosine distance between the analysed 
window and the genome wide mutation rate spectrum aggregated over all C3H 
tumours. Despite the elevated mutation frequency, there is no detected 
distortion of the mutation spectrum. g, A model based on homologous 
recombination repair intermediate, branch migration that produces 
heteroduplex segments of (i) mismatch:mismatch (circles) and (ii) lesion:lesion 
(red triangles) strands. Subsequent strand segregation would increase the 
mutational diversity of a descendant cell population but not the mutation 
count per cell (key as per Fig. 2). 
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Extended Data Fig. 5| Replication of TCR with lesion strand resolutionin 
Mus musculus castaneus.a, TCR of template strand lesions is dependent on 
transcription level (P15 liver, median TPM). Mutation rate estimates (circles) 
are the aggregate rates for expression level binned genes across CAST tumours 
(n=84). Expression level bin 0 contains n=2,645 genes, all subsequent bins 
contain n= 4,323 genes. See Methods for per-gene, per-tumour inclusion 
criteria. Empiric confidence intervals (99%) were calculated through bootstrap 
sampling (n=100 replicates) of genes within the expression level bin. 

b, Comparison of mutation rates for the 64 trinucleotide contexts: each 
context has a high and alowexpression point linked by aline.c, Sequence 
composition normalized profiles of mutation rate around TSS loci. 

d, Stratifying the data plotted inc by lesion strand reveals greater detail on 

the observed mutation patterns, including the pronounced influence of 
bidirectional transcription initiation. 
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Extended Data Fig. 6| Variant allele frequency distributions demonstrate 
high rates of non-mutagenic replication over segregating lesions. a-f, VAF 
distributions shown as probability density functions (total area under 

curve =1) for six example tumours, calculated taking into account observed 
multiallelic variation. The VAF for identified driver mutations is indicated 
(brown triangle). Tumour identifiers are shown top right along with the percent 
of genomic segments (based on mutation asymmetry segmentation) that are 
multiallelic. Skew shows Pearson’s median skewness coefficient for the VAF 
distributions. a—c, Tumours with no multiallelic segments and exhibit a 
symmetric VAF distribution showing minimal sub-clonal structure. 

d-f, Tumours with all segments multiallelic, illustrating the sub-clonal 
structure generated by segregating lesions. g, Tumours witha high proportion 
of multiallelic segments have a left-skewed VAF distribution indicating 
frequent non-mutagenic replication over segregating lesions. Percent of 
genome segments that are multiallelic (x-axis) plotted against VAF distribution 
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skew for 371 C3H tumours. Tumours with low estimated cellularity indicated in 
pink and excluded from correlation analysis (n= 335 independent tumour 
samples in Pearson’s correlation, two-sided significance from Fisher’s 
transform).h, As for g, but showing 84 CAST tumours (n=77 independent 
tumours included in Pearson’s correlation). i, Mutation asymmetry summary 
ribbon for example C3H tumour 90797_N2; C3H genome on the x-axis. The 
percent of mutation sites with robust support for multiallelic variation (y-axis) 
calculated in 10Mb windows (grey) and for each asymmetric segment (black). 
Thresholds for high (black), intermediate (grey) and zero (red) rates of 
multiallelic sites shown on the right axis. j, VAF density plots for the example 
tumour 90797_N2 (shown ini) mutations in asymmetry segments stratified by 
the multiallelic rate thresholds defined ini. As with individual tumour based 
analysis (a-h), high multiallelic rates correspond toa leftward skew of the VAF 
(black, grey) whereas segments without multiallelic variation (red) showa 
minimally skewed distribution. 
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Extended Data Fig. 7 | See next page for caption. 
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Extended Data Fig. 7 | Examples of mutation patterns generated by lesion 
segregation froma diverse range of clinically relevant mutagens. 

a-c, Genome-wide mutation asymmetry plots (shown as per Fig. 2a-c) for 
mutagen exposed human iPSCs°. Cells exposed to simulated solar radiation 
illustrate lesion segregation for ultraviolet damage (a). Immediately adjacent 
mutations (intermutation distance 10°) indicate CC>TT dinucleotide changes. 
Despite alow total mutation load (1,308 nucleotide substitutions, 842 
informative T>A changes), the mutational asymmetry of lesion segregation is 
evident for the aristolochic acid exposed clone® (b) and the polycyclic aromatic 


hydrocarbon DBADE (c) that is found in tobacco smoke. d, Summary mutation 
asymmetry ribbons (as per Fig. 2d) for all mutagen exposed clones withrl5)>5, 
which illustrates the independence of asymmetry pattern between replicate 
clones, almost universal asymmetry on chromosome x, and approximately 
50% of the autosomal genome with asymmetry over autosomal chromosomes. 
The dominant mutation type is indicated for each mutagen. In those clones 
with low mutation rates, some sister exchange sites are likely to have been 
missed leading to reduced asymmetry signal (for example, onthe X 
chromosome). Segments with <20 informative mutations are shown in white. 
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Extended Data Fig. 8| Lesion segregationis evident for multiple DNA 
damaging agents but not for damage independent mutational processes. 
a, DENinduced C3H tumour genomes (n= 371) typically show significant 
mutational asymmetry across their genome. Wald-Wolfowitz runs test 
(x-axis) P-values calculated using anormal approximation (two-sided). 
Nominal P= 0.05 significance threshold indicated by dashed blue line, 
Bonferroni-corrected threshold shown as solid vertical blue line. P-values 
<1x10™ are rank-ordered. The rl,, metric (Fig. 5a; Methods) is shown onthe 
y-axis, horizontal blue line gives emprical significance threshold of rl,)>5. 


b, Many human iPSCs grown from single cells after exogenous mutagen 
exposure’ show significant mutation asymmetry (n=148 WGS, 
mutagen-exposed cell lines). Statistical calculations and plotting asina, with 
adjustment of Bonferroni correction. Diverse categories of mutagen, denoted 
by point colour (see Fig. 5b), show asymmetry indicative of lesion segregation. 
c, Celllines with genetically perturbed genome replication and maintenance 
machinery” and similar mutation load to those in b do not show significant 
mutation asymmetry (n=72 WGS, genetically perturbed cell-lines). Statistical 
calculations and plotting as ina with adjustment of Bonferroni correction. 
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Extended Data Table 1| A lesion segregation-based test for oncogenic selection 


Strain Gene Mutation Mutation Odds ratio P-value Known driver 
count 

C3H _ Braf 6:37548568_A/T 151 2.13 5.77x10° Yes 

C3H ~~ _Hras 7:145859242 T/C 81 2.67 6.88x10 Yes 

C3H ~~ _Hras 7:145859242_ T/A 65 1.02 1 Yes 

C3H —Intronic Fmnl1 =—11:105081902_A/C 44 1.03 1 No 

C3H __ Intergenic 9:73125689_G/C 42 1.13 1 No 
C3H_~_Egfr 11:14185624_T/A 34 3.87 1.23x10+ Yes 
CAST  Braf 6:37451282_A/T 42 1.41 0.338 Yes 


Recurrently mutated sites in C3H and CAST tumours with sufficient estimated power to detect oncogenic selection through biased strand retention analysis (required >33 C3H recurrences or 
>41 CAST recurrences). Odds ratio values >1 indicate the predicted correlation of driver mutation and Watson/Crick strand retention in tumours with the candidate driver mutation, but not for 
those without the mutation. The Fisher’s exact test was performed on counts of chromosomes with Watson and Crick strand asymmetries (Methods). Each tested site was autosomal, thus total 
sample sizes were: n = 2 x 371 = 742 for C3H, and n = 2 x 84 = 168 for CAST. P-values (two-sided) are shown after Bonferroni correction (7 tests performed). Known driver indicates the mutation or 
its orthologous change has previously been implicated as a driver of hepatocellular carcinoma’. The CAST 6:37451282_A/T mutation is orthologous to the C3H 6:37548568_A/T mutation. 
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Data collection Illumina Software Control (ICS) 3.3.76 


Data analysis Open source data analysis tools were used, all documented with version numbers in the manuscript and Supplemental Table 3. The data 


analysis code specific to this project is available at https://git.ecdf.ed.ac.uk/taylor-lab/Ice-Is 


Software and versions used in the analysis: 
bwa mem 0.7.12 
CNVkit 0.9.6 

kalisto 0.43.1 

strelka2 2.8.4 

Bcftools 1.9 

Cactus toolkit 2.1 
Ensembl VEP 91.3 
uicebox 7.5 
oncodriveCLUSTL 1.1.1 
onocdriveFML 2.2.0 
picard 1.124 

Python 3.6 
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r-bit64 0.9_7 
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r-data.table 1.11.4 
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r-randtests 1.0.0 
r-sigFit 2.0.0 
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Samtools 1.9 
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The WGS FASTQ files are available from the European Nucleotide Archive (ENA) under accession: PRJEB37808. RNA-seq files are available from Array Express E- 
MTAB-85 18. Digitised histology images are available from Biostudies under accession S-BSST383. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


DX] Life sciences [_] Behavioural & social sciences [_] Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size Sample sizes for the main DEN-induced oncogenesis study were chosen to be the maximum possible with available resource, and included 
whole genome sequencing of more tumours that the pilot exome-sequencing project (Connor et al, J Hepatol. 2018). This earlier exome-only 
study identified mutation spectra and driver mutations in known cancer genes; this ensured that the larger whole genome study, including 
replication in a second strain, was sufficiently powered to detect cancer genome properties of interest. For some statistical tests (e.g. 
oncogenic selection) a simulation study was performed to calculate thresholds for 80% power prior to performing analysis. As documented in 
the manuscript. 


Data exclusions 116 exome sequenced human tumours from the ICGC LICA-CN project were excluded as documented in the manuscript methods section and 
Supplemental Table 2. These appear to suffer from a data processing artifact in the reported mutation calls (some types of mutation call are 
missing) and lower level data is not currently available. No exclusion criteria were pre-established. 


Replication Replication in a second mouse strain. Replication of mutation calls and multi-allelic variation in deeply sequenced exomes to confirm whole 
genome sequencing results (replication successful). Replication in human mutagen exposed cell lines (replication successful). Replication in 
publicly available human cancer data (replication successful). 


Randomization Multiple permutation based randomisations were performed, all documented in the manuscript and in the available code. 


Blinding Haematoxylin and eosin stained sections of liver tumours were blinded and assessed twice by a pathologist and discordant results reviewed 
by an independent hepatobiliary pathologist. Blinding was not relevant for genomic analyses as all processing was automated and all 
sequenced tumours were processed. 
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Materials & experimental systems Methods 
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Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals 15-day-old (P15) male C3H and CAST mice were treated with a single intraperitoneal (IP) injection of N-Nitrosodiethylamine 
(DEN; Sigma-Aldrich NO258; 20 mg/kg body weight) diluted in 0.85% saline. Liver tumour samples were collected from DEN- 
treated mice 25 weeks (C3H) or 38 weeks (CAST) after treatment. 


Wild animals This study did not use wild animals. 
Field-collected samples This study did not use field-collected samples. 
Ethics oversight Animal experimentation was carried out in accordance with the Animals (Scientific Procedures) Act 1986 (United Kingdom) and 


with the approval of the Cancer Research UK Cambridge Institute Animal Welfare and Ethical Review Body (AWERB). 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 
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Plant hormones coordinate responses to environmental cues with developmental 
programs’, and are fundamental for stress resilience and agronomic yield”. The core 
signalling pathways underlying the effects of phytohormones have been elucidated 
by genetic screens and hypothesis-driven approaches, and extended by interactome 
studies of select pathways*. However, fundamental questions remain about how 
information from different pathways is integrated. Genetically, most phenotypes 
seem to be regulated by several hormones, but transcriptional profiling suggests that 
hormones trigger largely exclusive transcriptional programs‘. We hypothesized that 
protein-protein interactions have an important role in phytohormone signal 
integration. Here, we experimentally generated asystems-level map of the 
Arabidopsis phytohormone signalling network, consisting of more than 2,000 binary 
protein-protein interactions. In the highly interconnected network, we identify 
pathway communities and hundreds of previously unknown pathway contacts that 
represent potential points of crosstalk. Functional validation of candidates in seven 
hormone pathways reveals new functions for 74% of tested proteins in 84% of 
candidate interactions, and indicates that a large majority of signalling proteins 
function pleiotropically in several pathways. Moreover, we identify several hundred 
largely small-molecule-dependent interactions of hormone receptors. Comparison 
with previous reports suggests that noncanonical and nontranscription-mediated 
receptor signalling is more common than hitherto appreciated. 


To examine phytohormone signal integration by the plant protein 
network, we first identified 1,252 genes with probable or genetically 
demonstrated functions in phytohormone signalling (Fig. 1a and 
Supplementary Table 1). The corresponding network of literature 
curated binary interactions (LCIs) from the IntAct database? (LCI ja) 
shows extensive intrapathway but sparse interpathway connectivity 
(Extended Data Fig. 1), which could reflect an insulated organization 
of hormone signalling, or could be an artefact of inspection biases®. 
We therefore experimentally generated a systematic (unbiased design) 
map of the phytohormone signalling network. After cloning open 
reading frames (ORFs) for 1,226 (98%) of the selected genes (Phy- 
HormORFeome), fivefold investigation of the pairwise matrix using 
ahigh-quality yeast two-hybrid (Y2H)-based mapping pipeline’ yielded 
the phytohormone interactome main (Phlya;y) network. To find links 


into the broader Arabidopsis network, we screened PhyHormORFeome 
against roughly 13,000 Arabidopsis ORFs*®, resulting inan asymmetric 
PhI,,; data set. We also conducted focused screens for interactions 
of pathway-specific repressors with transcription factors?(Phlpep-tr), 
and for hormone-dependent interaction partners of phytohormone 
receptor proteins (Phl,,5gy). In the stringent final step of the common 
Y2H pipeline, all candidate pairs were fourfold-verified (Fig. 1b). 

The combined PhI network contains 2,072 interactions, of which 
1,572 were previously unknown (Fig. 1c, Extended Data Fig. 1 and Sup- 
plementary Table 2). The interaction density in the symmetrically 
investigated Phl ai (0.4%o) is higher than in the proteome-scale Arabi- 
dopsis Interactome-1 (AI-1, 0.1%o)"°, but lower than in the abscisic-acid 
(ABA)-focused interactome (7.5%o)’. It is likely that the increasing focus 
on functionally coherent proteins underlies this trend, but system 
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Fig. 1| Phytohormone network mapping and analysis. a, Mutant phenotypes 
and overrepresented families in PhyHormORFeome candidates. ABA, abscisic 
acid; AUX, auxins; BR, brassinosteroids; CK, cytokinins; ET, ethylene; GA, 
gibberellic acids; IAA, indole-3-acetic acid; JA, jasmonic acid; KAR, karrikins; 
SA, salicylic acid; TF, transcription factor. b, Pipeline for mapping protein 
interactions. c, Phytohormone interactome (PhI) network. For node colours, 
see a. d, Phl yay Sampling sensitivity: black dots show the verified interactions 
of the first three primary-screen repeats (n= 3); the black line shows the screen 
saturation model based on three repeats (with the grey corridor representing 
the standard error); blue dots show identified interactions after five repeats. 

e, Y2H assay sensitivity: positive fractions of PRSpp (2 = 92) and RRSpy (n= 95). 
Error bars represent the standard error. f, Validation results: positive fractions 
of PRSp; (2 = 69), RRSp,, (2 = 83) and PhI (n=285). One-sided Fisher’s exact test; 
error bars represent the standard error of proportion. g, Overlap of Phl ain With 


differences" and screening parameters” also affect overall sensitiv- 
ity. We implemented an interactome mapping framework®” to com- 
pare PhI to literature-based network maps from IntAct and BioGRID” 
(LCIgio¢). Sampling sensitivity of Phlyamn after five repeat screens was 
86 + 5% (Fig. 1d). For benchmarking, we recurated” a positive anda 
random reference set (PRS,,, and RRS,,,) of 92 and 95 protein pairs 
(Supplementary Table 2), respectively. Benchmarking our Y2H system 
yielded an unconditional assay sensitivity of 20.4% (Fig. le); exclud- 
ing hormone-dependent PRS,,, interactions increased this to 23%. 
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LCIinta (2 =109) and LCI), (2 =150 interactions). Error bars show the 
propagated standard error. h, Distribution of the PhIyaiy degree and clustering 
coefficient. i, Number of hormone-signalling-function-enriched communities 
in Phlyyain (red arrow) compared with n=1,000 randomized control networks. 
j,JA- and CK-enriched community links (for node colours, seea).k,1, Distances 
between pathway combinations in PhIyajy (kK) and LCI,,., (I). Colours show 
average shortest distances; circle sizes show connection counts; insets show 
shortest distance distributions. m, Count of typeI PCPs (n=192) and type ll 
PCPs (n= 248) in Phlyain; Pvalue is from the analysis inn (*** P< 0.001). 

n, Proportion of PCP, in Phl,y4;y and LCI networks from bootstrap subsampling 
(n=1,000) of 100 interactions (two-sided Welch two sample t-test). Boxes show 
the interquartile range (IQR) and median; whiskers show highest and lowest 
data points within 1.51QR; outliers are plotted individually. 


The resulting overall completion of 16.0 + 6.8% matches the overlap with 
LCI data sets (Fig. 1g). Thus, missed interactions explain the incomplete 
overlap between Phlyya;y and LCI,,,, suggesting a low false-discovery 
rate. This is substantiated by the observation that no RRS», pair scored 
positive (Fig. le). 

To further assess PhI quality, we used a pulldown assay in which pro- 
tein pairs are expressed in wheatgerm lysate and, following an anti-Flag 
immunoprecipitation, interactions are detected through the activ- 
ity of asecond protein fused to Renilla luciferase. Benchmarking this 
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Fig. 2 | Validation of pathway contact points. a, Proportion of germinating 
seeds in the absence (Murashige and Skoog (MS) standard medium) or 
presence of 0.3 1M ABA (n= 20; three repeats). b, Root elongation inthe 
absence or presence of 30 uM ABA. Boxes represent the IQR; bold black line 
represents the median; whiskers indicate highest and lowest data points within 
1.5IQR; outliers are plotted individually (n = 8; two repeats). Col-O, wild-type 
background. c,d, CK-dependent anthocyanin accumulation in the indicated 
wild-type or mutant backgrounds in response to the indicated concentrations 
of 6-benzylamino purine (BA). c, Seedlings at 10 days after stratification 
following the indicated treatment. Scale bars, 2mm. d, Quantified anthocyanin 
content per gram of fresh weight for lines in c (n= 15; four repeats). A530, 
absorbance at 530 nm. e, ET-induced apical loop formation in response to 


assay with PRS,,,/RRS,,, revealed an assay performance similar to that 
of previous implementations"; the slightly increased background 
probably results from the functionally relative coherent search space 
from which RRS was sampled. Subsequent testing of 285 interactions 
from the unconditional Phlyain, PAl:x; and Phlep-rr Subsets yielded a 
PhI validation rate of 22.5%, which is indistinguishable from that of 
PRSpy (23.5%, Fig. 1f) and similar to that for the individual subsets 
(Extended Data Fig. 1). These data show that Phl is a high-quality map 
of the Arabidopsis phytohormone signalling network and is ona par 
with high-quality literature data. 

For hypothesis generation and local network analyses, the full PhlI 
will be most useful. For topological and systems-level questions, the 
symmetrically mapped Phl ya) Should be used to avoid biases®. Phl yay 
has ascale-free degree distribution and—in contrast to LCI), networks— 
a hierarchical modularity, as expected for unbiased network maps“ 


9 
33% set (literature) 


a 85% 


Validation set 
after assays 


15% 


10 pMACC (n= 10; three repeats). f, SA-associated phenotypes in response to 
inoculation with Pst, showing in planta Pst titres (n= 9); cfu, colony-forming 
units. g, Summary of hormone validation assays for 19 PCPs. Node colours 
indicate known pathway annotations (Fig. 1a); square colours indicate new 
phenotypes. h, BiFC assay of the PCP, candidate protein pairs DDL-TTL and 
DDL-MYC2 and matched negative controls (to the right) (each assay performed 
in duplicate). Scale bars, 10 um. i, Literature-reported specificity (single 
pathway annotation) and pleiotropy (multiple pathway annotations) of genes 
encoding 1,252 total target proteins and 27 proteins in the validation set, as well 
as updated specificity and pleiotropy following our hormone validation assay. 
a,b, d-f, Two-sided t-test; * P< 0.05; ** P< 0.01; *** P< 0.001. a-f, Precise 
Pvalues, biological repeats and n values are shown in Supplementary Table 5. 


(Fig. Ih and Extended Data Fig. 1). We used Phlya,y to investigate the 
topological organization of phytohormone signalling pathways. 
Important features of hierarchical networks are highly connected 
hubs and interconnected communities”. Using a detection algorithm 
based on edge betweenness”, we identified 21 network communities 
in PAlyain, Of Which 9 were substantially enriched in different phyto- 
hormone pathways (Fig. li, Extended Data Fig. 2 and Supplementary 
Tables 3, 4). Thus, the topology of Phly,;y recapitulates biological 
knowledge and confirms that at least some pathway proteins are 
highly interconnected. In addition, most communities encompass 
proteins from different pathways that possibly mediate crosstalk. 
In the jasmonate-signalling community, for example, the canonical 
jasmonate-pathway transcription factor MYC2 is physically linked 
to ABA signalling via interaction with the protein kinase CIPK14 
(Fig. 1j), validated by in vitro pulldown and bimolecular fluorescence 


Nature | Vol583 | 9 July 2020 | 273 


Article 


a uys77, ib d f ie 
Rs 4if) NACO89 RCAR1 + + + + + +  AT1G12810 SIED1 KLCR2CHG2 LSU1 AT2G19080 AT1G02816 AT5G09480 eck 
Lh CO) toris PP2C HAI HAI2 HAIS ABI1 ABI2 Empty pee AT2G43610 | weEaa ” @ET 
@ tf a “GH9G1 |,’ ASK4 @GA 
Ss [AQ veri a ATAGRBOSO oh I eee @JA 
SO nacoas FOR RZ KAR 
Nw () TOE2 ~ sx 
\ yy 7 TIM SL 
As OC) ToP18 : ne 
NOS ~~ 
WO @ wve73 T NIMINS NININ2 NIMIN1 RFC4 
RS Aus SN “eee 
. i @rrs c BBY 
3 f\\ 
b/ @) () myB70 
. 5) \ () Tops e g h i 
C) TOP11 Mock 1 uM KAR, 
i=) 
= b a 
S MyYB111 NPR3__NPR1 KAI2 i 2 a 
PRR7 =rac- +rac— 2 t e ¢ |e :¢ 
-SA +SA -SA +SA ( OF i 
\\©) TcP10 an boon a = a ioe —— = Ze Cleat ap at 3 op 
Wel NIMIN1 | | OR aT: A ( gE = ea = 
\\ a 7 ) 3 : 
\\@ tops +] | < Topi [ 1 iN 
NO attezz210 a os = a . 
oe |e b § 
IDD14 NIMINS | 32 oOo SNA Be =] — 3 “ i ed d cdc 
g () ToP19 rca By Ri PP2AA2 a Q ( L¢ ro 
eee 2 N + £300] | 
HAB1 eo DRN —— Hormone independent r 7 — i clan ¥ g ESo09 
saan | MAX2 g 3 a) — |e 
oO oe @ digas C) WRKY31 ===: Hormone dependent TPE | aS i de & ) 8 Sat pbs ata ro 
ee, | | | os | | eee 
O pr2cs ©) Enzymes = Blocked by hormone Mock 1MKAR, Col-0 kai2-2_pp2aa2-2 


Fig. 3 |Hormone-receptor interactions. a, ABA-dependent Y2H interactions. 
Allidentified interactors were systematically tested against all receptors inthe 
presence and absence of ABA. b,c, Y3H assays for the indicated protein triplets. 
Inall sets, we tested DB-RCARI for interactions with AD-MYB proteins inthe 
presence of the indicated PP2Cs and in the presence or absence of ABA. b, One 
of four representative Y3H results. The asterisk indicates an ABA-dependent 
interaction. c, Y3H subnetwork showing the data in b. d, SA-dependent 
interactors of NPR1,3,4.e, Each image shows one representative yeast colony 
(from four repeats) in the presence and absence of 100 uM SA for identified 

NPR interactors. f, Hormone-dependent and -independent interactions of 
KAI2, D14 and MAX2.g, Each image shows one representative of four yeast 


complementation (BiFC) (Extended Data Fig. 2). Additional pathway 
contacts occur between different communities (Fig. 1j). However, on 
average only 27% of pathway proteins reside within the correspond- 
ing communities, indicating that phytohormone signalling may not 
be predominantly organized into topological communities (Supple- 
mentary Table 3). 

We next analysed interpathway connectivity. The distances between 
the phytohormone pathways are considerably shorter in Phlyyqyy than 
inLCl,,, (Fig. 1k, 1). This is mirrored by the many more pathway contact 
points (PCPs; protein-interaction-mediated contacts between different 
pathways) in PhIyya; than LCI,,,. AS Some proteins operate in several 
pathways, we distinguished 192 type 1 PCPs (PCP,), involving proteins 
with strictly different annotations, from 248 type II PCPs (PCP,,), in 
which the interactors share annotations, but at least one has additional 
functions (Fig. 1m). Bootstrap subsampling confirmed that Phlyy,;,, con- 
tains substantially more type! PCPs (Fig. 1n), but not type II PCPs (data 
not shown), than do LCljjt4 or LCIpi.¢; this is valid for nearly all pathway 
pairs (Extended Data Fig. 3). Each discovered PCP supports a specific 
crosstalk hypothesis, and the abundance of PCPs suggests extensive 
protein-interaction-mediated information exchange among pathways. 


Validation of pathway contact points 

We experimentally tested whether PCPs reflect as yet unknown func- 
tions of the interacting partners. Assays for most hormones have been 
well established in seedlings. Therefore, and for standardization, we 
focused on seedling-expressed PCP interaction pairs. Homozygous 
T-DNA lines were obtained and validated for 27 candidate proteins 
involved in 19 PCP pairs. These were used in response assays for 
seven different phytohormones to establish whether the candidates 
function in the pathway of their respective PCP partner (Fig. 2a-f, 
Extended Data Figs. 4-7 and Supplementary Table 5). 
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spots for selected KAI2interactors in presence and absence of rac-GR24. 

h, Representative images from the analysis ini, showing root hair phenotypes 
from the indicated genotypes. Scale bars, 1mm. i, Quantification of root hair 
density and root hair length after the indicated treatments. Different letters 
indicate statistically significantly different groups within each graph (one-way 
analysis of variance (ANOVA); post hoc Tukey, P< 0.05). Boxes represent IQRs; 
bold lines show medians; whiskers indicate the highest and lowest data points 
within 1.51QR; outliers are plotted individually. Precise n and Pvalues for all 
group comparisons are in Supplementary Table 5.a,c,d, g, Modulated 
interactions are represented according to the key inc. Node colours represent 
hormone annotationsasing. 


ABA regulates seed germination and desiccation stress responses, 
including root growth"®. We found that in the presence of 0.3 uMABA, 
the germination of wild-type seeds was decreased by roughly 40%. 
By contrast, the candidate lines ddl, ;, -;and eds1,;, s, displayed a similar 
ABA hypersensitivity to control seedlings, in which expression of the 
RCARI ABA receptor is disrupted (Fig. 2a and Extended Data Fig. 4). 
Root growth was significantly altered by 30 uM ABA in 4 additional 
lines compared with wild type (@s1, 5, ¢4, D€2,it pr: P< 0.01} Bail, ca, 
wrky54,;. 54: P< 0.001) (Fig. 2b and Extended Data Fig. 4). Thus, six of 
the nine candidate lines (66%) exhibited at least one ABA phenotype. 

Anthocyanin production is a widely used assay for cytokinin 
signalling”. We found that, at low concentrations, cytokinin-induced 
anthocyanin accumulation was impaired in the candidate lines to a 
similar degree as in the spy control. At higher cytokinin concentra- 
tions, anthocyanin accumulation in the myc2,i, ja/aa line remained 
similar to that in spy seedlings, whereas the jaz) jr ja/aga line overaccu- 
mulated anthocyanin, indicating complexity in cytokinin signalling 
(Fig. 2c, d). 

For the ethylene-based assay, we assayed the ‘triple response’—that 
is, the formation of exaggerated apical hooks (loops) and the devel- 
opment of shorter and thicker roots and hypocotyls in dark-grown 
seedlings’. Ten of our twelve candidates (83%) displayed an apical loop 
phenotype; seven of these also displayed a root-growth phenotype, 
and the ¢th, ;. gp line also had a hypocotyl growth defect following treat- 
ment with the ethylene precursor aminocyclopropane carboxylate 
(ACC) (Fig. 2e and Extended Data Fig. 5). To ensure specificity, we tested 
seven mutant lines for proteins in PhI that showed no interaction with 
ethylene-annotated proteins. Of these controls, only one showed a 
weak root-growth phenotype and none exhibited a hypocotyl or loop 
formation defect (Fig. 2e and Extended Data Fig. 6). 

Salicylic acid mediates defence responses to (hemi-)biotrophic path- 
ogens”. Following inoculation with Pseudomonas syringae pv. tomato 


(Pst), titres in the gi, ;, c, mutant were significantly elevated (Fig. 2f), 
indicating enhanced disease susceptibility and impaired salicylic acid 
signalling. Similarly, leaves of mature rcar,j, 4g, ANd PP2CQ, i, apa Plants 
supported enhanced Pst growth (Fig. 2f). Assays for root-growth inhi- 
bition by brassinosteroids, gibberellins and jasmonates revealed new 
phenotypes for two candidates (for brassinosteroids and gibberellins) 
or one candidate (for jasmonates) (Extended Data Fig. 4). 

Altogether, interactome-guided phenotyping revealed a functionin 
new pathways for 74% of tested proteins (20/27), which are involved 
in 84% of interactions in the validation set (Fig. 2g and Extended Data 
Fig. 7). Notably, for all PCP, pairs a novel function was revealed for at 
least one partner, such that all interactions are substantiated by phe- 
notypes in at least one common pathway (Fig. 2g). For three of the 
six PCP, pairs, an additional common pathway was identified, such 
that more than half (11/19) of all PCP pairs operate genetically in two 
common pathways (Fig. 2g). To support these functional data, we 
used BiFC to demonstrate in planta interactions for nine PCP pairs 
(Fig. 2h and Extended Data Fig. 7). Before these experiments, a large 
majority of signalling proteins in the literature and in our validation 
set were considered to be pathway-specific (Fig. 2i). Following our 
interactome-guided phenotyping, however, 82% of proteins in the 
validation set are known to function in multiple pathways, and only 
one fifth are specific to a single pathway (Fig. 2g, i). The new annotations 
are distributed across different pathways (Extended Data Fig. 7) and 
the network degree does not correlate with the number of phenotypes 
(data not shown). As the validation set is not obviously biased, our 
observation of widespread pleiotropy may extrapolate to most of the 
phytohormone signalling network. Thus, our data point to a highly 
integrated central signal-processing network that channels differ- 
ent inputs into a balanced multifactorial output. To facilitate further 
studies, we provide an expression-based ‘edge score’ that indicates the 
possibility of each Phl interaction occurring in different plant tissues 
(Supplementary Table 6). 


Hormone-receptor interactions 


Input into the central processing unit is provided by hormone recep- 
tors, which often initiate signalling through small-molecule-regulated 
protein interactions”’. To better understand initial phytohormone 
signalling, we conducted interaction screens using soluble hormone 
receptors inthe presence and absence of their cognate hormone. For 
the ABA, gibberellic acid, indole-3-acetic acid (IAA), karrikin (KAR), 
salicylic acid and strigolactone receptors, we identified 241 interac- 
tions (Phl om), Of which 101 are hormone-dependent. Re-identified 
pairs include interactions of gibberellic acid receptors with DELLA 
proteins, and of the ABA receptors RCAR/PYR/PYL with type 2C pro- 
tein phosphatases (PP2Cs) (Fig. 3a and Extended Data Fig. 8), which 
display known patterns of hormone dependence”. Notably, several 
ABA receptors also interacted with transcription factors and other 
non-PP2C proteins (Fig. 3a). As some of these also link to PP2Cs, we 
wondered whether interactions are combinatorially modulated, and 
investigated by yeast three-hybrid the effect of different PP2Cs on 
RCARI/PYL9 interactions with MYB-family transcription factors. The 
RCAR1-MYB73 interaction was blocked by several PP2Cs, whereas 
the RCARI-MYB77 interaction was enabled by ABI1/2, together 
demonstrating dynamic modulation of complex formation (Fig. 3b, c). 
In addition, PP2C-independent RCAR functions have been described 
for RCAR9/PYL6 via MYC2 (ref. ”) and for RCAR3/PYL8 via MYB77 
(ref. **). Our data suggest that such core-pathway-independent func- 
tions may be more widespread. The independently validated inter- 
action of DELAY-OF-GERMINATION 1 (DOGI1) with PP2Cs* similarly 
points to noncanonical PP2C-mediated signalling mechanisms. Thus, 
core-pathway-independent signalling and complex multimeric interac- 
tion regulation are important mechanisms underlying the functional 
diversification in the ABA signalling system. 


Receptors for the defence hormone salicylic acid are the 
NON-EXPRESSOR OF PATHOGEN RELATED PROTEIN 1 (NPR1) and 
its orthologues NPR3 and NPR4 (ref. *). While NPR1 is a well-studied 
positive regulator of defence-gene transcription, NPR3 and NPR4 are 
emerging as alternative negative or complementary transcriptional 
regulators**”°. The pattern of salicylic-acid-regulated NPR3 interactions 
(Fig. 3d and Extended Data Fig. 9), especially with NIMIN proteins, dif- 
fers from the described NPRI1 pattern”, suggesting dynamic complexity 
of this signalling system. EMB1968/RFC4, a member of the replication 
factor C (RFC) complex, is anew interactor that is common to NPR1 and 
NPR3, and may integrate defence with DNA repair or replication. Most 
previously unknown NPR3/NPR4 interactors can be linked to immunity 
via mutant phenotypes or known interactions with virulence effec- 
tors and immune receptors? (Fig. 3d and Extended Data Fig. 9). These 
data support the biological validity of the interactions and indicate 
that salicylic acid receptors also act through nontranscriptional 
signalling. 

The KAR and strigolactone pathways have been discovered most 
recently and mediate germination (in the case of KAR) and diverse 
aspects of development and organismal interactions”’. We screened the 
KAR receptor KAI2 and the strigolactone receptor D14, together with 
the F-box protein MAX2, in the absence and presence of a stereoiso- 
meric mix of two synthetic strigolactones, which bind to D14 and KAI2, 
respectively”’. For KAI2 we found the previously described interaction 
with MAX2 and 21 newinteractors, of which 15 are hormone-dependent 
(Fig. 3f, g and Extended Data Fig. 9). It was previously found that KAI2 
regulates root hair length (RHL) and root hair density (RHD)*°. As both 
phenotypes are also regulated by auxin, and as the hormone-dependent 
KAI2interactor PP2AA2 regulates the PIN auxin exporters, we wondered 
whether PP2AA2 mediates the effect of KARs on these phenotypes. 
Similar to kai2-2 mutants, pp2aa2-2 plants displayed a lower RHL and 
RHD than wild-type Col-0 plants (Fig. 3h, iand Supplementary Table 5). 
Notably, in both kai22 and pp2aa2-2 mutants the response to exog- 
enous KAR treatment was abolished, indicating that these proteins 
jointly mediate signalling by the KAR pathway. 

Transcriptional changes are common outcomes of phytohormone 
signalling. Investigating Phl,.y15, we found no evidence of substantial 
hormone crosstalk at the level of transcriptional regulators from dif- 
ferent pathways converging on transcription factors (data not shown). 
Nonetheless, only a quarter of the transcription factors that we found 
here to interact with the selected regulators were previously implicated 
in hormone signalling (Extended Data Fig. 10). Although most path- 
ways converge on TCP-family transcription factors, which are known 
for their high connectivity’®, most transcription factors interact with 
repressors from one to three pathways, suggesting more specific signal 
integration at this level. 

We have presented here a systematic map of the Arabidopsis phyto- 
hormone signalling network, which reveals an unexpectedly high inter- 
connectivity of signalling pathways. If the observed level of functional 
pleiotropy extends into the larger hormone signalling network, the con- 
cept of dedicated signal-transduction pathways may need to be revised 
in favour of network-based models. The small-molecule-dependent 
interactions of hormone receptors point towards prominent roles 
for noncanonical signalling mechanisms. We anticipate that our find- 
ings and the PhI resource will stimulate mechanistic and systems-level 
analyses of Arabidopsis and crop plants. 
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Methods 


No statistical methods were used to predetermine sample size. Experi- 
ments were not randomized unless otherwise indicated. The inves- 
tigators were not blinded to allocation during plant phenotyping 
experiments and outcome assessment. 


PhyHormORFeome selection and cloning 

We selected target genes as follows: first, those with a known mutant 
phytohormone signalling phenotype on the basis of annotations in 
the Arabidopsis Hormone Database (AHD2.0)*; second, all members 
of gene families that were overrepresented in step 1; and third, those 
highlighted by input from colleagues. In total we selected 1,252 genes, 
for which1,226 full-length ORFs could be obtained. To physically assem- 
ble the PhyHormORFeome, we picked 688 ORFs from our published 
AtORFeome collection’, 276 from the Arabidopsis Biological Resource 
Center (ABRC), 11 from colleagues and 277 amplified froma Col-0 com- 
plementary DNA mix from different tissues. For RNA extraction, we 
grew 6-10-day-old A. thaliana Col-0 seedlings, separating organs and 
tissues from mature plants (from flowers and siliques (all developmen- 
tal stages), nodes, internodes, rosette leaves and cauline leaves; roots 
were from 15-day-old plants; all plants were grown on vertically standing 
solid MS agar plates imbibed seeds). From all plant organs, tissue types 
and seedlings, we extracted specific total RNA using a NucleoSpin RNA 
kit from Macherey and Nagel, following the manufacturer’s recommen- 
dations. For cDNA synthesis, we modified the Superscript III (Thermo 
Fisher 18080044) protocol using 25 ng of random primers and 250 ng 
oligo d(T)16 per 1 pg total RNA. The mixture was heated to 70 °C for 
5 min and incubated at 21 °C for 10 min. A mixture of 2.5 pl (0.1 1M) 
dithiothreitol (DTT), 10 URNase OUT (40 Up’), 250 USSIII (200 Up’), 
4 pl SSIII <5 buffer and 2.5 pl (2 uM) dNTPs was added and incubated at 
21°C for 10 min and then at 42 °C for 120 min. To generate cDNA longer 
than 5 kilobases, we added a further 250 U of SSIII (200 U pI) to the 
mixture, followed by incubation at 55 °C for 30 min for elongation and 
70 °C for 15 min for inactivation. All generated cDNAs from different 
organs, tissues and seedlings were mixed in equal amounts and 2 wl of 
the undiluted cDNA mixture (roughly 100 ng) was used to amplify the 
ORFs of interest. ORF amplification was conducted as a nested polymer- 
ase chain reaction (PCR) to attach a¢tB cloning sites for further Gateway 
cloning. The primers comprise 18 base pairs specific to attB and12 bp 
specific to a partial actB site (forward attB overhang, GCAGGCTCAGGA; 
reverse attB overhang, GAAAGCTGGGTC). All ORFs were generated with 
astop codon. Inthe second PCR, full attB sites were added to the ORFs 
(forward attB, GGGACAAGTTTGTACAAAAAAGCAGGCTCAGGAATG; 
reverse attB, GGGGACCACTT TGTACAAGAAAGCTGGGTC). Gateway 
cloning and yeast transformation were carried out as described’. ORFs 
cloned herein are available from stock centres. 


Y2H-based pipeline for interaction mapping 

Network mapping was performed as described’. In brief, bait ORFs were 
expressed as genetic fusions to the GAL4 DNA-binding domain (pDEST- 
DB); prey ORFs were expressed as genetic fusions to the minimal GAL4 
activation domain (pDEST-AD). Both constructs were maintained as 
low-copy centromeric (cen) plasmids and expressed from weak adh2 
promoters. Primary screening was carried out by mating individual 
DB-plasmid-containing haploid yeast strains (Y¥8930, MATa) witha 
mini-pool of haploid Y8800 (MATa) AD-plasmid-containing strains. 
Following a three-day selection on selective plates containing 1mM 
3-amino-1,2,4-triazole to repress background HIS3 activity, positive sin- 
gle colonies were picked and retested on selective media and cyclohex- 
imide control plates. Colonies showing specific selective growth 
were lysed; the respective ORFs were amplified with generic primers 
that include position-specific barcodes and subsequently identified 
using the kiloSeq service by seqWell. All primary Y2H screens were per- 
formed once, except for the PhI yam Screen, which was carried out five 


times. The receptor screens and the Phl,,-p screen were verified systemati- 
cally: that is, in the final verification all identified interaction candidates 
were tested against all receptors or repressors/regulators, respectively. 
The receptor screens were performed in the absence and presence of 
the respective phytohormones applied to the selective media. For the 
ABA-receptor screen, we used 30 lM abscisic acid; for the IAA-receptor 
screen we used 100 pM IAA; for the gibberellic-acid-receptor screen we 
used 100 uM GA3; and for the salicylic-acid-receptor screen we used 
100 uM salicylic acid. Signalling pathways involving the strigolactone 
receptor D14 and the karrikin receptor KAI2 were screened with 5 uM 
rac-GR24. 


Y3H assay 

We genetically fused RCARI1 to the GAL4 DNA-binding domain using 
pDEST-DB, and the MYB proteins to the minimal GAL4 activation 
domain using pDEST-AD. To test for modulation of these interactions, 
we expressed the indicated PP2Cs from the helper plasmid pVTU-DEST, 
maintained via the URA3 selection marker. All combinations of RCAR1 
and PP2Cs were transformed into the haploid yeast strain Y8930 and 
mated against Y8800 transformed withthe AD-MYB constructs. Y3H 
assays were performed independently four times in the presence or 
absence of 30 uM ABA on selective plates (synthetic complete media 
(Sc) lacking tryptophan (W), leucine (L), uracil (U) and histidine (H) 
(Sc-W-L-U-H)) containing 1 mM 3-amino-1,2,4-triazole to repress back- 
ground HIS3 reporter activity. Interactions that were verified in three 
repeats were counted as Y3H interactions. 


Protein-protein interaction reference set 

Candidate interactions for the positive reference set (PRS) were com- 
piled from protein-protein interactions from IntAct (downloaded 
August 2014)° and BioGRID (version 3.2.115)*. At that time, the IntAct 
data set contained 17,574 interactions and the BioGRID data set con- 
tained 21,474 interactions among A. thaliana molecules. From both 
data sets, we removed protein-DNA interactions, interactions derived 
from papers that reported more than 100 interactions, and non-binary 
interactions in protein complexes. Subsequently, we filtered both data 
sets for interactions described in at least two publications or identified 
in at least two binary interaction detection methods. This resulted in 
233 interactions, from which we randonlly picked 140 interactions 
described in 247 publications for recuration. This yielded a selection 
of 92 highly reliable binary protein-protein interactions, 
which constitute the PRS,,,. Ten of these 92 interactions were 
phytohormone-dependent interactions. To assemble the random 
reference set (RRS,,,), we randomly sampled 95 protein pairs from 
proteins in our PhyHormORFeome, excluding already described 
protein-protein interaction pairs. 


Parameters of the interaction mapping framework 

To assess the quality of the PhIl map—that is, false positive and false 
negative interactions—we implemented the interactome mapping 
frameworkas described*® and estimated the assay sensitivity, sampling 
sensitivity, precision and completeness. 

The completeness of the Phy; screening space—that is, the propor- 
tion of tested protein pairs in comparison to the theoretical number 
in the full search space—was based on the number of available ORFs 
in PhyHormORFeome. The initially defined search space comprised 
1,252 loci and thus 1,567,504 possible protein pairs. For the screen 
of Phlyain, We tested 1,254 ORFs corresponding to 1,199 gene loci, of 
which 1,179 were present as AD- and DB-hybrid constructs, 15 as 
AD-hybrid constructs only, and 5 as DB-hybrid constructs only. 
Together, AD- and DB-hybrid constructs for 90.2% of locus combina- 
tions were tested for interactions; this percentage corresponds tothe 
completeness. 

We estimated the sensitivity of our Y2H assay for detecting 
phytohormone-:signalling-related proteins by benchmarking the 
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system using PRS,,,/RRS,,,. Of the 92 tested possible PRS,,, pairs, 19 
pairs were detected, whereas no RRS», Scored positive, thus yielding 
an assay sensitivity of 20.7 + 4.2%. Exclusion of the nine PRS,,, pairs 
that require presence of a phytohormone-none of which was detected 
by the unconditional Y2H assay—resulted in an unconditional assay 
sensitivity of 22.8% + 4.6%. 

Sampling sensitivity—that is, saturation of the screen with regard to 
detectable interactions measured by the assay sensitivity—was esti- 
mated as described”. In brief, using a modified Michaelis-Menten func- 
tioninthe R package drc (3.0-1), we modelled the increase of identified 
interactions per additional repeat screen. Using the first three repeats 
of the Phl ain Screen to develop the saturation model, we estimated 
saturation to occur at 616 + 38 interactions. We then challenged the 
model with two further repeats of the primary screen. This resulted 
in a data set of 529 interactions, which matches the model prediction 
of 519 + 31 interactions after five repeats. 

Overall sensitivity is the product of assay sensitivity and sampling 
sensitivity. With an assay sensitivity of 20.7 + 4.2% and sampling 
sensitivity of 85.9 + 5.3%, the overall sensitivity is 17.8% + 6.8% including 
conditional interactions in PRS,,;. The unconditional overall sensitivity 
of 19.4 + 7.0% is the product of the unconditional assay sensitivity of 
22.8 + 4.6% and the sampling sensitivity of 85.9 + 5.3%. We estimated 
the overall completion of the screen as the product of the overall 
sensitivity and completeness; thus, overall completion of Phl ayy iS 
16.0% + 6.8%. 


Luciferase validation assay 

Protein expression. We expressed proteins constituting PRS,,/RRSpy, 
pairs and the interaction pairs from the different subsets (see text) in 
cell-free coupled transcription translation wheatgerm lysate (Promega, 
L3260) using SP6 promoters. Of each protein pair, one partner was 
expressed with an amino-terminal Flag tag, and the second carried an 
N-terminal Renilla luciferase fusion. Protein pairs were coexpressed 
according to the manufacturer’s protocol (Promega), except that 
the amounts were proportionally adjusted to a final reaction volume 
of 20 ul. Input DNA plasmids were isolated from 1 ml bacterial cul- 
tures grown in Terrific Broth for 20 h ona vibration platform shaker 
(Union Scientific) using a Qiagen Biorobot3000 and Turbo Prep 96-well 
plasmid isolation kits. These yielded approximately 20-40 ng pI" of 
DNA, of which 4 pl were used in a 20 ul (final volume) TnT reaction. 
Protein expression was carried out by incubating the reaction mixture 
containing both plasmids for 2 h at 30 °C. 


Preparation of plates for immunoprecipitation. Plates coated with 
anti-Flag antibodies were made in-house by incubating white 96-well 
Lumitrac high binding plates (Greiner) overnight at 4 °C with 75 pl 
phosphate-buffered saline (PBS; pH 7.4) per well containing 8 pg mI 
M2 anti-Flag antibody (Sigma). Two hours before use, the antibody 
solution was replaced with 100 pl blocking buffer containing 10 pg pl 
bovine serum albumin (BSA), followed by two hours of shaking at room 
temperature. 


Interaction detection. Following protein expression, 2 ul of lysate 
were diluted in 28 il PBS (pH 7.4); expression of the prey protein was 
quantified by adding 10 pl Renilla glow luciferase substrate. The remain- 
ing expression lysate was diluted in 42 pl blocking buffer and added to 
the empty wells of the immunoprecipitation plates. The plates were 
incubated with gentle shaking for 2 hat 4 °C, then washed three times 
with 100 pl blocking buffer. Coimmunoprecipitation efficiency was 
determined by adding 10 I Renilla glow luciferase substrate (Promega) 
diluted in 30 pl PBS (pH 7.4). Interaction pairs were scored as positive 
when the expression level was at least 10% of the median of the respec- 
tive plate (expression positive), the immunoprecipitation exceeded 
the median immunoprecipitation of the plate (minimum immunopre- 
cipitation signal), and the Z-test of the immunoprecipitation efficiency 


gave ascore greater than 0.4 (immunoprecipitation ratio of the sample 
relative to that of the plate). To determine data set precision, we tested 
atotal of 446 pairs from PRS,,, (78), PRS unc (69), RRSpp (83), PAl yam (115), 
PhI_xz (110) and Phlp.p-1¢ (60). Data set differences were statistically 
compared using a one-sided Fisher’s exact test. 


Network topology 

To determine the network topology of Phlyaiy, We calculated the distri- 
butions of degree and clustering coefficients for the indicated networks 
using the igraph package. We used the distributions to determine the 
underlying network topology™. 


Network visualization and annotation 

Networks were visualized with Cytoscape® (version 3.7.2) using protein 
annotations from Araport11 (ref. °°). Hormone annotations were down- 
loaded from AHD2.0, and extracted from The Arabidopsis Informa- 
tion Resource (TAIR10) gene ontology annotations (downloaded on3 
August 2018). Hormone annotations were inferred from gene ontology 
annotations when a gene had a gene ontology term that contained one 
of these key words: auxin, abscisic acid, brassinosteroid, cytokinin, 
ethylene, gibberellin, jasmonic acid, salicylic acid, strigolactone or kar- 
rikin. Gene ontology annotations with evidence code IEP were excluded 
from all analyses. 


Community detection 
Communities in Phl yay Were determined using the edge-betweenness 
algorithm implemented in the R package igraph (version 1.2.4)”. 


Hormone enrichment 

Communities were tested for enrichment with proteins that functionin 
hormone signalling pathways by using the hormone annotations from 
AHD2.0 and TAIRI1O. For each community, we compared the number of 
proteins witha given pathway annotation to the total in the full Phlyaiy 
network using a two-sided Fisher’s exact test and multiple hypothesis 
corrected with Benjamini—Hochberg algorithm. 


Gene ontology enrichment 

All communities were tested for gene ontology enrichment using 
the R package GOstats (version 2.50.0)*8. Gene ontology annotation 
data were derived from the R package GO.db (version 3.7.0). Communi- 
ties were tested for overrepresentation of gene ontology terms by using 
a hypergeometric test function, hyperGTest, invoked with parameter 
conditional = TRUE. P values of each community were corrected for 
testing multiple gene ontology terms using the Benjamini-Hochberg 
method. 


Pathway distance calculation 

To determine the distance between different hormone pathways, we 
determined all shortest paths between proteins of the respective hor- 
mone signalling pathways. We considered only those shortest paths 
that do not contain proteins in the same pathways as those under con- 
sideration. We calculated the mean path length from all shortest paths 
between the two pathways. 


PCP determination and network comparison 

We used hormone pathway annotations from AHD2.0 and Gene Ontol- 
ogy for this analysis. From the Phlya;y network, we extracted interac- 
tions between two proteins annotated with distinct hormone signalling 
pathways (type 1), and interactions between two proteins involved in 
distinct but also common pathways (type II). To compare the number 
of PCPs in Phl ya; With LCI networks, we used a subsampling bootstrap- 
ping approach. From each network we conducted 1,000 iterations of 
sampling 100 interactions without replacement. For each sampling, we 
determined the total number of typel and type II PCPs andthe number of 
PCPs for each specific hormone combination. We compared the derived 


distributions for total PCPs from Phl,y,;, with the distributions obtained 
from LCI networks using a two-sided Welch’s two-sample t-test. We 
compared the distributions of hormone-combination-specific PCPs 
using a two-sided Wilcoxon’s test and multiple testing corrected by 
the number of hormone combinations tested (45). 


Literature curated interactions 

Interactions curated from literature were downloaded from IntAct® 
and BioGRID*. Arabidopsis protein-protein interactions were 
extracted from the IntAct database (downloaded in June 2016) and 
from the BioGRID database version 3.4.142 (downloaded in November 
2016). 


Phytohormone sources 

Phytohormones were obtained from the following manufacturers: 
ACC, Sigma (catalogue number A-3903); 6-benzylamino purine, Sigma 
(B3408); brassinolide, Sigma (B1439); karrikin2 (KAR2), Olchemim 
(025 682); karrikin2 (KAR2), Toronto Research Chemicals (F864800) 
(for Y2H experiments); gibberellic acid 3, Duchefa (GO907); rac-GR24, 
Chiralix (CX23880); IAA, Sigma (12886); paclobutrazol (Pac), Duch- 
efa (P0922); salicylic acid, Sigma (S5922); ABA, Sigma (A1049); and 
methyl-jasmonate (Me-JA), Sigma (392707). 


Plant material and growth conditions 

All A. thaliana lines—wild-type, ahp2, as1, bee1, bee2, bim1, bpm3, 
cbl9, cos1, cpK1, ddl, eds1, ga3ox1, gai, gi, hub1, ibrS, jaz1, jaz3, kai2-2, 
myb77, myc2, nap1;1, nia2, pks1, pp2aa2-2, pp2ca, rcar1, rcn1, rgll, tt4, 
ttl, wrky54, rga, rga-28, spy and ein3—are in the Col genetic background. 
Seeds were obtained from the Nottingham Arabidopsis Stock Centre 
(NASC) and propagated for three generations in a greenhouse envi- 
ronment at 21°C and LD light (16 h/8 h). For genotyping, one leaf of a 
12-14-day-old plant was frozen in liquid nitrogen, and genomic DNA 
was extracted in 1.5 ml tubes using Edwards DNA-extraction buffer*®. 
For expression-level analysis of mutant lines, RNA was extracted using 
a NucleoSpin RNA kit from Macherey-Nagel and the Moloney murine 
leukemia virus (M-MuLV) reverse transcriptase (Biozym 350400201) 
according to the manufacturer’s recommendations. All seeds were 
surface sterilized and stratified for 3 days at 4 °C in the dark on MS 
plates or on plates containing the indicated additives. LD light con- 
ditions were 75-85 1M m7” s7 measured with a LI-250A light sensor 
(LI-COR). Nicotiana benthamiana seeds were spread on soil and grown 
ina greenhouse environment at 23 °C and with LD light (16 h/8 h). For 
all assays, measurements were carried out with distinct samples (no 
repeat measurements on the same sample). For statistical tests of sig- 
nificance, we assumed anormal distribution of the measured variable 
(for example, root length); hormone treatments and genotype were 
tested as covariates. 


Measurement of triple response to ethylene 

Sterile seeds were placed directly on standard MS plates or on plates 
with10 pM ACC, then stratified for 3 days at 4 °C inthe dark, transferred 
into light for 1h to induce germination, and incubated for 3 days at 23 °C 
in the dark. The formation of apical hooks versus loops was scored 
visually; image analysis for hypocotyl and root-length determination 
was performed using Fiji imaging software*! and the Simple Neurite 
Tracer” plugin (version 3.1.3). 


Root elongation measurements 

Seedlings were grown on MS plates to five days after germination, 
and then transferred to MS mock plates or MS containing the 
appropriate phytohormone additive as indicated in the figures (Pac, 
0.5 pM, 1.0 uM; brassinolide, 0.1 uM, 0.5 1M; Me-JA, 25 uM). Trans- 
ferred seedlings were grown in the vertical position for another four 
days at 23 °CinLD light conditions (16 h/8 h). Root lengths were deter- 
mined as above. 


Anthocyanin accumulation 

Anthocyanin content in response to the indicated treatments was 
determined as described in ref. * and expressed per gram of fresh 
weight. 


Root hair growth 

Root hair growth was analysed as in ref. °° using 1 1M KAR2. Arabidopsis 
seeds were stratified in the dark for 3 days at 4 °C and then transferred 
toa growth cabinet at 22 °C with a 16 h/8 h light/dark cycle (intensity 
approximately 100 uM ms“). Images were taken witha Zeiss SteREO 
Discovery.V8 microscope (Carl Zeiss) equipped with a Zeiss Axiocam 
503 colour camera (Carl Zeiss). We determined the number of root 
hairs by counting root hairs with lengths of 2-3 mm (from the root tip) 
on each root, and root hair length was measured for 10-12 different 
root hairs per root as described above. For karrikin treatments, KAR2 
(Olchemim) was dissolved in 75% methanol to prepare a10 mM stock 
solution. Analysis and data were based on two repeats. 


Infection assay 

To measure bacterial proliferation in 4-5-week-old plants, we carried 
out assays as described“ using Pseudomonas syringae pv. tomato 
DC3000. To prepare the inoculum, bacteria were grown overnight on 
NYGA medium (5 g I" bactopeptone, 3 g I yeast extract and 20 ml I 
glycerol) and resuspended and diluted to1 x 10° colony-forming units 
per mlin10 mM MgCl.. Bacteria were inoculated by syringe infiltration 
of two leaves per plant, and harvested at four days post inoculation as 
described“. In short, three leaf discs per sample were incubated for1h 
inl1O0 mM MgCl, containing 0.01% Silwett. The resulting suspension was 
serially diluted, 20 pl of each dilution were plated, and colonies were 
counted after two days. 


BiFC assay 

For BiFC, we used the vectors pMDC43-YFC, pMDC43-YFN* and 
pDEST-VYNE(R), pDEST-VYCE(R)**. After Gateway recombination, the 
ORF-containing destination clones were introduced into Agrobacte- 
rium tumefaciens strain GV3101. Transformed A. tumefaciens cells were 
grown overnight and resuspended in infiltration buffer (10 mM MgCl, 
10 mM2-(N-morpholino)ethanesulfonic acid (MES) pH 5.6 and 150 uM 
acetosyringone) with a final optical density at 600 nm (OD,o,) of 0.3 
for each expression vector. The abaxial leaf surface of N. benthamiana 
plants was transiently transformed with A. tumefaciens (containing the 
constructs and the silencing inhibitor protein p19) by infiltration using a 
needleless syringe. Two days after infiltration, two leaves from two inde- 
pendently transformed plants were used for fluorescence detection. 
Reconstitution of fluorescence was observed under an epifluorescence 
microscope (Olympus BX61) using yellow fluorescent protein (YFP) and 
red fluorescent protein (RFP) bandpass filters for the YFC-MYC2 and 
YFN-CIPK14 interactions, respectively; either a TCS SP8 multiphoton 
microscope (Leica) or an LSM880 laser scanning confocal microscope 
(Carl Zeiss) was used for the remaining BiFC assays. The laser excita- 
tion wavelength for both microscopes was 488 nm and the detection 
band was set to 493-545 nm for Venus protein. The objectives were a PL 
APO x40/1.10 and a Plan-Apochromat x20/0.8 M27 for the TCS SP8 and 
LSM880, respectively. Image analysis was performed using Fijiimaging 
software“. Analyses were performed in duplicate for all constructs. 


In vitro pulldown assays 

For in vitro pulldown assays, amylose resin (New England Biolabs) 
coated with MBP-MYC2 was incubated for 2 hat 4 °C with an equimo- 
lar amount of purified glutathione-S-reductase (GST)—CIPK14. Wash 
and elution steps were performed according to the manufacturer’s 
instructions. Pulldowns were analysed by western blot using antibod- 
ies against GST (Amersham Biosciences) and maltose-binding protein 
(MBP, New England Biolabs). 
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Protein- protein interaction likely scores 

We developed an edge-score model to determine the protein-protein 
interaction likely score in different plant tissues and development 
states. The edge-score modelling was designed to exploit transcript 
abundance to estimate the possibility, and to some extent the likeli- 
hood, of an interaction taking place in a given tissue and condition. It 
is based on using transcript abundance as a proxy for protein concen- 
tration and modelling binary complex formation by the law of mass 
action. Tissue-specific transcriptome data were collected from ref. *””. 
FastQC (version 0.11.7) was used for read quality control before and 
after trimming. Adaptor sequences and low-quality reads were trimmed 
with Trimmomatic version 0.36 (ref. *°) using the ILLUMINACLIP: 
TruSeq3-SE.fa:2:30:10, LEADING:3, TRAILING:3, SLIDING WINDOW:4:15 
and MINLEN:36 options. High-quality reads were mapped to the TAIR1O 
reference genome. The estimation of gene abundance was performed 
with Kallisto version 0.45 (ref. *”). To estimate the chance of two pro- 
teins, iand/, interacting in a given condition, we used the law of mass 
action to obtain a quantitative estimate of their interaction feasibility. 
We estimated the amount of proteins iand/ using their respective tran- 
script levels, t;and¢, as proxies, and determined edge scores as follows. 
In each tissue, let ¢'* and tk denote the abundance of genes i and jin 
tissue t,. The score of the interaction between proteins iand/in tissue 
t,is calculated as: 
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After obtaining a score for each interaction in each tissue, we com- 
pute the raw edge score (es’) of a specific interaction in tissue ¢, by 
Z-transformation: 
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Finally, we normalize this score to the range of [0, 1]: 
esj'* — min(es;*) 


max(es;*) — min(esj*) 


esi - 


Ahigher normalized edge score indicates that an interaction inthis 
tissue is more likely, as both proteins are expressed jointly. 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


All functional, genetic and interaction data generated here are available 
as Supplementary Information. The genes selected for interactome 
mapping (the search space) are presented in Supplementary Table 1. 
All protein-protein interaction data can be found in Supplementary 
Table 2. The data for genetic validation assays can be found in Sup- 
plementary Table 5. The preliminary edge scores for all interactions 
identified here are in Supplementary Table 6. Additionally, all pro- 
tein interactions identified here have been submitted to IMEx (http:// 
www.imexconsortium.org) through IntAct? with identification code 
IM-27834. Source data are provided with this paper. 


Code availability 


Custom scripts used here are available at https://github.com/ 
INET-HMGU/PhyHormInteractome. 
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ddl, rcn1and ttl (candidate) lines. d, Gibberellic acid (GA)-mediated inhibition 
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Extended Data Fig. 5 | Hormone-response assays II. a, Salicylic acid 
(SA)-associated phenotypes, showing Pst titres four days after leaves were 
inoculated with Pst by syringe infiltration. In planta Pst titres were elevated in 
mature plants of the indicated genotypes relative to wild-type Col-0 plants. 
b, Jasmonate (JA)-induced root growthin the absence (mock) or presence of 
25 uM Me-JA.c-f, Ethylene-induced triple response in control conditions 
compared with Col-0 plants. c, Apical hook formation inthe absence or 
presence of 10 uMACC. Shownare representative results underlying the 
quantitation ind. Scale bars, 5 mm.d, Proportion of apical loop formation 
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(rather than hook formation) following treatment with 10 uM ACC, for the same 
lines asinc. e, Hypocotyl length in the absence or presence of 10 pM ACC for 
thesamelines asind. f, Root elongation inthe absence or presence of 10 1M 
ACC for same lines asin d. Two-sided t-test; *P< 0.05, **P< 0.01, ***P< 0.001. 
b,e, f, Boxes represent IQRs; black lines represent medians; whiskers indicate 
highest and lowest data points within 1.5 IQRs; outliers are plotted individually. 
a,b, d-f, Two-sided t-test; *P< 0.05, **P< 0.01, ***P< 0.001. Precisen values for 
each repeat and exact Pvalues are provided in Supplementary Table 5. 
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Extended Data Fig. 6| Ethylene-induced triple response assays (negative 
controls). The ethylene-mediated triple response in negative control lines is 
compared with that in Col-0 and ein3 lines. a, Proportion of apical loop 
formation in response to10 pM ACC. b, Hypocotyl length in the absence or 
presence of 10 uMACC.c¢, Root elongation in the absence or presence of 


10 pMACC. Two-sided t-test; *P< 0.05, **P< 0.01. b,c, Boxes represent IQRs; 
black lines represent medians; whiskers indicate highest and lowest data points 
within 1.51QRs; outliers are plotted individually. Precise n values for each 
repeat and Pvalues arein Supplementary Table 5. 
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Extended Data Fig. 7 | PCP validation. a, Summary of hormone-assay results 
for 27 candidate genes. Light colours indicate previously known hormone 


pathway annotations. Bright colours indicate significant (Fig. 2, Extended Data 


Figs. 4-7 and Supplementary Table 5) new phenotypes observed in validation 
assays. b, BiFC analysis in NV. benthamiana of two PCP, pairs (AHP2-MYC2, 
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MYB77-RCARI) and five PCP, pairs (CBL9-IBR5, PP2CA-IBRS5, TT4-COS1, 
ASI-NIA2, EDS1-HUBI1). PCP pairs were also tested with one or two negative 
controls in the BiFC assay. Each construct was tested in duplicate and in two 
independent assays; one representative result is shown. Scale bars, 10 pm. 
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representative set of Y2H results, out of four repeats, showing yeast growthon 
selective media in the presence or absence of 30 1M ABA. All candidate 
interactors identified in primary screens were tested systematically against all 
receptors inthe shown representative verification experiments. g, Plate layout 
of candidate interactors tested with the indicated RCARs in b-f. 
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Extended Data Fig. 9 |Hormone-dependent Y2H interactions. 
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Extended Data Fig. 10 | Pathway convergence on transcription factors. 
Y2H-derived map showing interactions of repressors and non-DNA-binding 
transcriptional regulators (boxed and colour-coded for involvement in the 
relevant main pathway) with Arabidopsis transcription factors. Above the 


repressors/regulators are transcription factors that interact specifically with 
regulators from one pathway. Lower layers show transcription factors that 
interact with regulators from several pathways. Node annotations are 
represented by the indicated colour codes. 
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Plant hormones known as strigolactones control plant development and interactions 
between host plants and symbiotic fungi or parasitic weeds’ *. In Arabidopsis thaliana 


and rice, the proteins DWARF14 (D14), MORE AXILLARY GROWTH 2 (MAX2), 
SUPPRESSOR OF MAX2-LIKE 6, 7 and 8 (SMXL6, SMXL7 and SMXL8) and their 
orthologues forma complex upon strigolactone perception and play a central partin 
strigolactone signalling* ©. However, whether and how strigolactones activate 
downstream transcription remains largely unknown. Here we use a synthetic 
strigolactone to identify 401 strigolactone-responsive genes in Arabidopsis, and show 
that these plant hormones regulate shoot branching, leaf shape and anthocyanin 
accumulation mainly through transcriptional activation of the BRANCHED 1, TCP 
DOMAIN PROTEIN 1 and PRODUCTION OF ANTHOCYANIN PIGMENT 1 genes. We find 
that SMXL6 targets 729 genes in the Arabidopsis genome and represses the 
transcription of SMXL6, SMXL7 and SMXLS8 by binding directly to their promoters, 
showing that SMXL6 serves as an autoregulated transcription factor to maintain the 
homeostasis of strigolactone signalling. These findings reveal an unanticipated 
mechanism through which a transcriptional repressor of hormone signalling can 
directly recognize DNA and regulate transcription in higher plants. 


Strigolactones are aclass of carotenoid-derived plant hormones” that 
have fundamental effects on shoot branching, leaf development, plant 
height, anthocyanin accumulation, root architecture, and adaptation 
to drought and phosphate starvation**. The D14 protein in Arabidopsis’, 
rice® or pea" and orthologues of the KARRIKIN INSENSITIVE2 (KAI2) 
protein in Striga hermonthica” “ have been identified as strigolac- 
tone receptors, which hydrolyse strigolactones, bind covalently to 
intermediate molecules, and form a complex with the F-box protein 
D3 and the transcriptional repressor D53 (refs. *8). D3 can adopt two 
conformations to regulate the hydrolysis activity of D14 and to mediate 
the ubiquitination and degradation of D53 (ref. °). D53 and its ortho- 
logues SMXL6, SMXL7 and SMXL8 (SMXL6, 7, 8 hereafter) in Arabidopsis 
have been proposed to be transcriptional repressors that function by 
interacting with transcription factors and with the transcriptional 
corepressor proteins TOPLESS (TPL) and TPL RELATED (TPR)*”1, 
Enormous efforts have been made to characterize strigolactone- 
responsive genes, but only a few have been identified so far, including 
BRANCHED 1 (BRC1) in Arabidopsis and pea’®”, MAX3, MAX4, 
SMXL2, 6, 7,8, SALT TOLERANCE HOMOLOG 7/BZR1-1D SUPPRESSOR 1 
(STH7/BZS1), DWARF14-LIKE2 (DLK2), KAR-UP F-BOX1 (KUF1) and several 
auxin-responsive genes in Arabidopsis ’*"*"**°, and D53 and CYTOKININ 
OXIDASE/DEHYDROGENASE9 (OsCKX39) in rice”*1. These modest tran- 
scriptional changes cannot explain how strigolactones regulate diverse 


aspects of plant development and responses to various environmental 
signals. Furthermore, the synthetic strigolactone analogue rac-GR24, 
widely used experimentally, comprises a pair of enantiomers, GR24°°S 
and GR24°*s (Extended Data Fig. 1a), and stimulates both the strigo- 
lactone and the karrikin signalling pathways*"®. It is therefore crucial 
to explore strigolactone-responsive genes by using chemicals that 
specifically stimulate strigolactone signalling with high efficiency. 

Here we report the identification of key strigolactone-responsive 
genes that specifically regulate various developmental processes such 
as shoot branching, leaf development, anthocyanin accumulation, and 
drought adaption. More importantly, we further show that SMXL6 
can directly bind to the promoters of SMXL6, 7, 8 and regulate their 
transcription, thus functioning as a transcription factor in strigolac- 
tone signalling. 


Identifying strigolactone-responsive genes 

To identify genes that respond specifically to strigolactones in Arabi- 
dopsis, we compared the effects of rac-GR24 with those of synthetic 
GR24°°S, GR24*"° and GR24°""4° (Extended Data Fig. 1a). We found that 
GR24*” could trigger the degradation of SMXL6 labelled with green 
fluorescent protein (GFP), but that the enantiomer GR24°"*"° had 
few effects (Extended Data Fig. 1b). Although 5DS stereoisomers are 
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Fig.1|GR24* efficiently stimulates strigolactone signalling. a, Expression 


of SMXL7 and BRC1 in 10-day-old wild-type seedlings treated with 5 uM 
rac-GR24, GR24°S or GR24*°. Data are means +s.e.m.;n=3 biologically 
independent samples. b, Induction of SMXL7and BRCI upon GR24*°° 
treatment was blocked by D/4 mutation (d/4 plants). Dataare means +s.e.m.; 
n=3 biologically independent samples. Col-0 plants are wild type. c, Overview 
of genes upregulated or downregulated upon treatment with 5 uM GR24*” for 
2hor4hinthree RNA-seq replicates. a, b, Pvalues are shown; two-sided 
Student’s t-test. 


more effective than 4DO stereoisomers in triggering the strigolactone 
biosensor StrigoQuant”, GR24°"s can activate both strigolactone and 
karrikin signalling through D14 and KAI2 (ref. 7). We found that GR24*°° 
promoted the expression of SMXL6, 7,8 and BRC1 with higher efficiency 
than GR24°°S in a D14-dependent manner (Fig. 1a, b and Extended Data 
Fig. Ic, d), indicating that GR24*”° specifically induces transcriptional 
responses‘ through D14 and is an ideal chemical for experimental 
stimulation of strigolactone signalling. 

We further conducted RNA-sequencing (RNA-seq) analyses upon 
GR24*° treatment, and identified 99 upregulated and 57 downregu- 
lated genes after 2 hours, as well as 147 upregulated and 150 downregu- 
lated genes after 4 hours (Fig. 1c and Supplementary Table 1). These 
differentially expressed genes (DEGs) cover nearly a quarter of the 
genes that have previously been reported to be responsive to GR24°°S 
and rac-GR24 (Supplementary Table 1). More importantly, roughly 
90% of DEGs were newly found (Supplementary Table 1) and could be 
verified by reverse transcription with quantitative polymerase chain 
reaction (RT-qPCR; Extended Data Fig. 2), indicating a high efficiency 
of GR24*”° in stimulating strigolactone signalling in Arabidopsis. 

The most highly enriched gene ontology (GO) termsinthe DEGs upreg- 
ulated at 2 hours were involved in microtubule function (Supplementary 
Table 2), suggesting a potential regulation of the cytoskeleton by str- 
igolactone signalling. Early auxin-inducible genes were enriched inthe 
2-hour downregulated DEGs (Supplementary Table 2 and Extended Data 
Fig. 2a), consistent with the previous finding that most of the rac-GR24 
repressed genes are induced by auxins’®. Upon 4 hours of treatment, 
expression of the auxin-biosynthesis genes YUCCA 3 (YUC3) and YUCS 
was downregulated (Supplementary Table 1 and Extended Data Fig. 2b). 
Strigolactones have previously been found to induce clathrin-mediated 
endocytosis of PIN-FFORMED 1 (PIN1) to deplete membrane-localized 
PIN1 and repress polar auxin transport in the stem”, while auxin pro- 
motes expression of the strigolactone-biosynthesis genes MAX3 and 
MAX4 (ref.”*). These results indicate complex crosstalk between strigo- 
lactones and auxin, both of which are important in plant architecture. 

We found that GR24*”° induces the expression of At14a-L/KE 1 
(AFL1; Extended Data Figs. 2c, 3a and Supplementary Table 1), which 
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is required for drought tolerance in Arabidposis”®. More importantly, 
the smxl6, smxl7, smxI8 triple mutant (s678 hereafter) displayed strong 
droughttoleranceandelevatedAFL1 expression (Extended DataFig.3b,c). 
Several karrikin-inducible genes!””’ were also promoted by GR24*°° 
(Extended Data Fig. 2d and Supplementary Tables 1, 2), suggesting 
crosstalk between the strigolactone and karrikin signalling pathways”. 
In addition, analysis of the Kyoto Encyclopedia of Genes and Genomes 
(KEGG) showed that carotenoid and flavonoid biosynthesis pathways 
were enriched in 2-hour and 4-hour upregulated DEGs, respectively 
(Supplementary Table 2 and Extended Data Fig. 2e). 

To investigate how strigolactones control plant development 
through transcriptional regulation of responsive genes, we system- 
atically analysed DEGs that encode transcription factors, finding 24 
upregulated and 14 downregulated genes (Extended Data Fig. 2f, g 
and Supplementary Table 3). We then focused on BRC1, TCP1 and PRO- 
DUCTION OF ANTHOCYANIN PIGMENT 1 (PAPI) to study their roles in 
strigolactone-regulated plant development. 


Strigolactones, BRC1 and shoot branching 


SMXL6 represses transcription by interacting with TPL/TPR pro- 
teins in a manner that depends on its ethylene-responsive-element- 
binding-factor-associated amphiphilic repression (EAR) motif”"°. We 
found that wild-type SMXL6 could rescue the shoot-branching phenotype 
of the s678 triple mutant, whereas the EAR-deleted SMXL6 (SMXL6““*) 
could not (Extended Data Fig. 4a, b), indicating that transcriptional regu- 
lation through the EAR motif is required for strigolactone-mediated 
shoot branching. BRC1is a key regulator that represses bud outgrowth 
and functions as a signal integrator in numerous pathways”. In une- 
longated axillary buds, BRC expression is dramatically repressed by 
SMXL6 (refs. *”°) in an EAR-dependent manner (Extended Data Fig. 5a). 
The high-branching mutant brc1-6, a null allele of BRC1, completely 
suppressed the shoot-branching phenotypes of s678 (Extended Data 
Fig. 5b-f), indicating an essential role of BRC1in strigolactone-inhibited 
shoot branching”*”*. Furthermore, GR24*”° induced the expression of 
HB40—aBRC1-target gene that regulates the biosynthesis of abscisic acid 
(ABA)”°—in a manner dependent on D/4 and BRCI (Fig. 2aand Extended 
Data Fig. 5g). Notably, endogenous ABA levels in unelongated axillary 
buds were reduced in the strigolactone-biosynthesis-deficient mutant 
max3-9 but elevated in s678 plants, and this elevation could be repressed 
by brc1-6 (Fig. 2b, Extended Data Fig. Sh), indicating that strigolactones 
elevate ABA levels in buds through transcriptional induction of BRC1. 
These results indicate that the EAR motif of SMXL6 is essential for 
strigolactone-activated BRCI expression in axillary buds, which pro- 
motes ABA accumulation and represses bud outgrowth. The shoot 
auxin-transport system is also important in strigolactone-regulated 
shoot branching, probably in an EAR-independent manner?®°", Thus, 
SMXL6, 7 and 8 promote shoot branching through transcriptional repres- 
sion of BRCI and nontranscriptional regulation of auxin transport. 


Strigolactones, TCP1 and leaf shape 


Strigolactones promote leaf elongation through SMXL6 in an 
EAR-dependent manner (Extended Data Fig. 4c, d). However, the leaf 
shape of the s678 brc1-6 quadruple mutant was similar to that of the 
S678 mutants (Extended Data Fig. 6a), raising the possibility that str- 
igolactones also regulate leaf development through other downstream 
genes. We found that the expression of TCP1, which regulates leaf devel- 
opment”, was induced upon GR24*”° treatment in a D14-dependent 
manner (Fig. 2c). The TCP1 expression level was greatly repressed in 
max3-9 but dramatically increased in s678 and max3-9 s678 mutants 
(Fig. 2d), and the transcriptional regulation of TCP1 depended on the 
EAR motif of SMXL6 (Extended Data Fig. 6b). Furthermore, overexpres- 
sion of TCPI1-SRDX—a dominant-negative form of TCP1 with an added 
12-amino-acid repressor sequence*—resulted in rounder leaves in 
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Fig. 2|BRC1and TCP1 regulate shoot branching and leaf shape, 
respectively, in strigolactone signalling. a, Expression of HB40 after 
treatment with GR24*” for 4 hin wild-type (Col-O) and brcl-6 plants. Data are 
means +s.e.m.;n=3 biologically independent samples. b, ABA contentin 
unelongated axillary buds of wild-type and the indicated mutant plants. Data 
are means ¢+s.e.m.;n=4 biologically independent samples. c, Expression of 
TCP1 in wild-type and d14-1 seedlings treated with GR24*°°. Dataare 


both wild-type and s678 backgrounds (Extended Data Fig. 6c). Notably, 
although the ¢cp1-1 null mutant formed leaves similarly to the wild type, 
it greatly reduced the ratio of leaf length to width in the 5678 back- 
ground (Fig. 2e and Extended Data Fig. 6d, e), indicating that SMXL6,7,8 
suppress leaf elongation partially by repressing TCP1 expression. There- 
fore, TCP1 makes an important contribution to leaf-shape regulation 
in strigolactone signalling. 


Strigolactones, PAPs and anthocyanin synthesis 


Transcripts of PAPI, PAP2, MYB113 and MYB114—which encode R2R3-MYB 
transcription factors and activate anthocyanin biosynthesis—were 
greatly induced after GR24*”° treatment for 2 hours and 4 hours (Fig. 3a 
and Extended Data Fig. 7a). Furthermore, after GR24*”° treatment for 
4 hours, expression of the anthocyanin-biosynthesis genes DIHYDRO- 
FLAVONOL 4-REDUCTASE (DFR), TRANSPARENT TESTA 7 (TT7) and 
ANTHOCYANIDIN SYNTHASE (ANS), which are activated by PAPI, PAP2, 
MYB113 and MYB114 (ref. **), was induced (Fig. 3a and Extended Data 
Fig. 2e). Endogenous anthocyanin levels were moderately decreased in 
max3-9 but greatly increased in s678 mutants (Extended Data Fig. 7b, 
c). Accordingly, expression levels of PAPI, PAP2, MYB113, MYB114 and 
DFR were reduced in max3-9 but dramatically increased in s678 and 
max3-9s678 mutants (Extended Data Fig. 7d). Overexpression of SMXL6 
repressed the elevated expression of PAPI, PAP2, MYB113, MYB114 and 
DFR and anthocyanin accumulation in 5678 in an EAR-dependent man- 
ner (Extended Data Fig. 7c, e). Furthermore, the impaired anthocyanin 
accumulation seen in max3-9 mutants was completely suppressed by 
the pap1-D mutation (Fig. 3b, cand Extended Data Fig. 7f), which causes 
constitutive overexpression of PAP1 and overaccumulation of anthocya- 
nin®’. More importantly, anthocyanin accumulation ins678 mutants was 
partially rescued by the pap2-1 null mutation, and could be completely 
rescued by the pap1-2 null mutation or by pap1-2 pap2-1 double muta- 
tions (Fig. 3d and Extended Data Fig. 7g), showing that PAPI and PAP2 
work downstream of SMXL6, 7, 8 in strigolacton-promoted anthocyanin 
biosynthesis. The induction of DFR expression upon GR24*”° treatment 
was disrupted in pap1-D mutants (Fig. 3e), indicating an important role 
of PAP1in strigolactone-induced DFR expression. Together, these results 
suggest a gene-regulation cascade in which strigolactones induce the 
expression of PAPI, PAP2, MYB113 and MYB114.and consequently activate 
the transcription of DFR, ANS and T77 to elevate anthocyanin abundance. 


SMXL6, 7, 8 function as transcription factors 


To understand how SMXL6 regulates the expression of target genes, 
we investigated its global binding profiles through chromatin 
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means +s.e.m.;n=3 biologically independent samples. d, Expression of TCP1in 
wild-type and mutant seedlings. Data are means +s.e.m.;n=3 biologically 
independent samples. e, Morphology of the fifth leaves of 3-week-old wild-type 
and indicated mutant plants. Data are means +s.e.m.;n=12 leaves. a-e, Pvalues 
are shown; Tukey’s honest significant difference (HSD) test (b, e) or two-sided 
Student’s f-test (a,c, d). 


immunoprecipitation sequencing (ChIP-seq) assays in the s678 mutant 
and in a transgenic line that overexpresses haemagglutinin-tagged 
SMXL6 (pSMXL6:SMXL6-HA), which rescues the phenotypes and 
gene-expression profiles of s678 (Extended Data Figs. 4, 5a, 6b and7c,e). 
The two biological replicates shared 1,079 peaks, which were highly 
enriched in 3-kilobase promoters (59.3%) compared with intergenic 
(24.3%) and intragenic (16.4%) regions (Fig. 4a, b and Supplementary 
Tables 4, 5). Among the 729 genes targeted by SMXL6-HA in two rep- 
licates, 28 are GR24*”°- responsive genes, leading to roughly 3.22-fold 
enrichment compared with nonspecific binding (Fisher’s exact test, 
P=1.93 x 107; Supplementary Tables 6, 7). SMXL6-HA could bind the 
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Fig. 3 | Strigolactones promote anthocyanin accumulation by inducing 
transcription of PAPI and anthocyanin-biosynthesis genes. a, Expression 
of PAP1 and DFR in wild-type (Col-0) and d14-1 seedlings after GR24*°° 
treatment. Data are means +s.e.m.;n=3 biologically independent samples. 

b, Anthocyanin accumulation in the stem base of 3-week-old wild-type and 
indicated mutant plants. Scale bar, 1mm.c,d, Anthocyanin content (A535 and 
A650 anthocyanins) in3-week-old wild-type and indicated mutant plants. Data 
are means +s.e.m.;n=3 (c) or 4 (d) pools (6 seedlings per pool). e, Expression of 
DFRinwild-type and papI-D seedlings after GR24*”° treatment for 4h. Data are 
means +s.e.m.;n=3 biologically independent samples. a, c—e, Pvalues are 
shown; two-sided Student’s ¢-test (a,c, e) or Tukey’s HSD test (d). 
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Fig. 4| SMXL6 binds directly to the SMXL7 promoter. a, SMXL6-binding sites 
in two ChIP-seq replicates. b, Distribution of overlapping peaks bound by 
SMXL6 in the Arabidopsis genome. CDS, coding sequence. c, SMXL6can 
associate with the SMXL7 and BRCI promoters in ChIP-seq (left) and ChIP- 
qPCRassays (right). Chromatin from seedlings of $6:S6-HA s678 and s678 plants 
was immunoprecipitated with anti-haemagglutinin (HA) polyclonal 
antibodies. Red lines below the peak regions represent probes used in ChIP- 
qPCRassays (c) and EMSAs (e). In ChIP—qPCR assays, the enrichment of target 
gene promoters is displayed as the percentage of input DNA. Dataare 

means +s.e.m.;n=4 biologically independent samples. The promoter of 
TUBULINALPHA 2(TUA2) was used as anonspecific target. d, Transcriptional 
activities of SMXL6 and SMXL6“** on the SMXL7 (left) and BRCI (right) 
promoters in Arabidopsis protoplasts. Data are normalized to samples 


promoter regions of SMXL6, 7,8 and BRCI, but no SMXL6-binding 
signals were found in TCP1, PAP1, PAP2, MYB113 or MYB114 (Fig. 4c, 
Extended Data Fig. 8a and Supplementary Tables 5, 6). Furthermore, 
transcription of luciferase reporter genes driven by the promoters 
of SMXL6, 7,8 and BRC1 was repressed by SMXL6, and deletion of the 
EAR motif or GR24*”° treatment substantially released such transcrip- 
tional repression (Fig. 4d and Extended Data Fig. 8b, c), indicating that 
transcriptional regulation of SMXL6 is controlled by strigolactones. 
Expression of SMXL6, 7, 8 was dramatically decreased in nonelon- 
gated buds of strigolactone-biosynthetic and -signalling mutants, 
indicating that SMXL6, 7, 8 undergo negative feedback regulation 
(Extended Data Fig. 8d, e). The finding that aSMXL homologue, HEAT 
SHOCK PROTEIN 101 (HSP101), can bind to specific messenger RNAs** 
prompted us to examine whether SMXLs have potential DNA-binding 
activities. Surprisingly, we found that SMXL6 and SMXL7 could bind 
directly to the promoters of SMXL6, 7,8, and that SMXL8 bound directly 
tothe SMXL7 promoter in vitro, with the shifted band showing amuch 
stronger signal for the SMXL7 promoter in electrophoretic mobility 
shift assays (EMSAs; Fig. 4e and Extended Data Fig. 9a, b). SMXL6 could 
directly bind to the P7-3 and P7-3-3 fragments of the SMXL7 promoter, 
with either the ATAA to TATT or the CAA to GTT mutation in these frag- 
ments completely disrupting the interaction (Fig. 4f and Extended 
Data Fig. 9c, d). Notably, the binding of SMXL6 to the ATAACAA motif 
of the SMXL7 promoter had functional consequences: mutations in this 
signature greatly disturbed the transcriptional repression activity of 
SMXL6 in vivo (Fig. 4g), indicating that the ATAACAA motif is essential 
for SMXL6 to directly bind and inhibit SMXL7. Consistent with this 
result, ATAACAA is 1.66-fold and 3.31-fold enriched in, respectively, 
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expressing GFP and are means +s.e.m.;n=3 biologically independent samples. 
LUC, luciferase. e, SMXL6 binds directly to the SMXL6, 7and 8 promoters but 
not the BRCI promoter in EMSAs. For competition, 50- and 100-fold excess 
unlabelled probes (cold probes) were mixed with biotin-labelled probes. GST, 
glutathione-S-transferase. Data represent three independent experiments. 

f, Fine mapping of the SMXL7 promoter regions bound by SMXL6. We used 
20-fold excess unlabelled probes for competition. The ‘supershift’ band was 
detected in the presence of an anti-GST monoclonal antibody. Data represent 
four independent experiments. g, Repression by SMXL6 on the SMXL7 
promoter depends onthe ATAACAA motif. Data are normalized to samples 
expressing GFP. Data are means +s.e.m.;n=3 biologically independent 
samples. c,d, g, Pvalues are shown; two-sided Student's t-test. 


1-kb and 100-bp flanking sequences around the ChIP-seq peak sum- 
mits (P value less than 1 x 10° by y’ test) of GR24*°°-responsive genes 
(Supplementary Table 7) compared with the 1-kb promoter ofall genes 
inthe Arabidopsis genome. In addition, SMXL6 could not directly bind 
to the BRC1 promoter in EMSAs (Fig. 4e), suggesting that SMXL6 may 
work together with unknown transcription factors to repress BRC1 
expression. Together, these results show that the SMXL6, 7, 8 proteins 
can bind DNA directly and negatively regulate their own transcrip- 
tion, functioning as autoregulated transcription factors to maintain 
the homeostasis of SMXL6, 7, 8 and to downregulate strigolactone 
signalling. Our findings reveal a previously unknown mechanism for 
transcriptional regulation in plant hormone signalling pathways. 


Discussion 


Here we have demonstrated an autoregulation model for strigolactone 
signalling (Fig. 5). SMXL6 binds directly to the promoters of SMXL6, 7,8 
and negatively regulates their transcription; strigolactones trigger 
the formation of the SMXL6-D14—MAX2 complex and degradation of 
SMXL6 through the ubiquitination proteolysis system, thus releasing 
the transcriptional inhibition of SMXL6, 7, 8 and forming a negative 
feedback loop that is required for the strict modulation of SMXL6, 7,8 
abundance. Meanwhile, SMXL6 can also, together with the TPL/TPR 
proteins, function as a transcriptional repressor to regulate shoot 
branching, leaf elongation and anthocyanin biosynthesis, mainly 
by repressing the transcription of BRCI, TCP1 and PAPI, respectively 
(Fig. 5). These findings show a panorama of transcriptional regulation 
in strigolactone signalling. Further identifying and investigating the 
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Fig. 5| Proposed model of transcriptional regulation by SMXL6. In the 
absence of strigolactones (SLs; top), SMXL6 plus TPL bind directly to the 
promoters of SMXL6, 7,8 and repress their expression, functioning asa 
repressive transcription factor (TF). Meanwhile, SMXL6 can also forma 
complex with unknown transcription factors that are expected to recognize 
and bind tothe promoters of BRCI, TCP1 or PAPI, repressing their transcription 
as well. Inthe presence of SLs, D14 binds SLs (white and red circle within D14) 
and promotes the formation of the D14-SCF™***-SMXL6 complex, triggering 
the ubiquitin-mediated degradation of SMXL6. This relieves the 
transcriptional repression of SMXL6, 7, 8. Newly synthesized SMXL6 proteins in 
turnrepress transcription, forming a negative feedback loop. The degradation 
of SMXLéalso releases its transcriptional repression of BRCI, TCP1 and PAPI, 
thus activating signalling cascades that repress shoot branching, promote leaf 
elongation and enhance anthocyanin biosynthesis, respectively. 


HB40 


transcription factors associated with SMXL6 and their modulation 
of early-responsive genes will uncover how strigolactones regulate 
diverse aspects of plant development and symbiotic relationships 
with arbuscular mycorrhiza fungi. 

In rice, IDEAL PLANT ARCHITECTURE 1 (IPA1)—a key transcription 
factor that controls plant architecture—can interact physically with 
D53 and function in the feedback regulation of strigolactone-induced 
D53 expression’. The feedback regulation involving SMXL6is different, 
however. Furthermore, SQUAMOSA PROMOTER BINDING PROTEIN-LIKE9 
(SPLY) and SPL15, the orthologues of /PA1 in Arabidopsis, function in 
parallel pathways with SMXL6, 7, Sin the regulation of shoot branching 
(Extended Data Fig. 10), indicating diverse signalling mechanisms in 
monocotyledonous and dicotyledonous plants*. Our finding that 
SMXL6 can bind DNA and functions as a transcription factor distin- 
guishes strigolactone signalling from other plant hormone signalling 
pathways, including the auxin, jasmonate and gibberellin pathways, 
that feature degradation mediated by the SCF ubiquitin ligase, thus 
revealing noncanonical mechanisms across phytohormone signal- 
ling. Re-evaluation of the direct targets of essential transcriptional 
repressors in hormone signalling promises to reveal hitherto hidden 
mechanisms underlying plant development, evolution and environ- 
mental adaptation. 
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Methods 


Plant materials and growth conditions 

The smx16, 7,8 triple mutant”, max3-9 mutant”, d14-1 mutant”, pap1-D 
mutant® and 35S:SMXL6-GFP trangenic line” have been described 
previously. T-DNA alleles of brc1-6 (GK-471F06.17) and the spl9-4 spl15-1 
double mutant (CS67865) were distributed by the Nottingham Arabi- 
dopsis Stock Centre (NASC). Alllines are in the Col-0 background. Arabi- 
dopsis seeds were surface sterilized, vernalized at 4 °C for 2-4 days, 
and germinated on half-strength Murashige and Skoog (MS) medium 
containing 1.0% (w/v) sucrose and 0.7% (w/v) agar. For morphological 
observations, 10-day-old seedlings were transferred to pots contain- 
ing a1:1 vermiculite:soil mixture saturated with one-third-strength MS 
medium, and grown under a long day photoperiod (16-h light, 8-h dark 
cycle) witha light intensity of 60-80 1.E m~s ‘at 21 °C. For preparation 
of protoplasts, seedlings were grown under a short-day photoperiod 
(8-h light, 16-h dark cycle) for 4 weeks as described”. 


Chemicals 

Synthetic rac-GR24 was obtained asa racemic mixture of equal amounts 
of GR24°"S and GR24°"° from Chiralix. The synthetic strigolactone ana- 
logues GR24*°°, GR24°""4° and GR24°°S were synthesized as described 
in the Supplementary Information. 


Chemical treatment and RNA-seq analysis 

Whole seedlings grown for 10 days horizontally on half-strength MS 
plates were carefully transferred to half-strength MS liquid medium 
without injury. Seedlings were cultured in half-strength MS liquid 
medium for 2h, then treated with 5 1M GR24*”° or acetone (solvent) 
in the greenhouse for 2 h or 4h. Approximately 100-mg whole seed- 
lings were harvested at each time point and frozen in liquid nitrogen 
as independent biological samples. Total RNAs were isolated using a 
TRizol kit (Invitrogen). Illumina sequencing libraries were constructed 
and sequenced using the Illumina system Hiseq2000 with 125-bp read 
lengths. The paired-end clean reads of RNA-seq were aligned to the 
Arabidopsis reference genome TAIR1O using STAR (version 2.4.2a)**. 
Fragment quantifications were computed with FeatureCounts (version 
1.5.0)” in paired-end mode; the features are exons. Default param- 
eters were used for both STAR and FeatureCounts. Expression differ- 
entiation analyses were conducted using the R (version 3.3.1) package 
DESeq (version 1.26.0)** with three biological replicates, and genes 
with fold changes of more than 1.5 and P values of less than 0.05 were 
selected for further analysis. To limit interference by the circadian 
clock and environmental factors, we compared GR24*”° treatment at 
2 hto solvent control at 2h and to samples at the start point of treat- 
ment; genes upregulated in both comparisons were considered to be 
upregulated strigolactone-responsive genes at 2h. The same compari- 
son criteria were applied to identify downregulated genes at 2 h and 
responsive genes at 4 h. 


Gene ontology analysis 

The enriched functions of DEGs in RNA-seq data sets were annotated 
with the gene ontology (GO) function” and Kyoto Encyclopedia of 
Genes and Genomes (KEGG)*? pathways, which were determined using 
the PlantGSEA program”. 


Gene-expression analysis 

Ten-day-old seedlings were cultured in half-strength MS liquid medium 
for2h, then treated with 5 uMGR24*°, GR24°S, rac-GR24 or acetone (sol- 
vent) inthe greenhouse for 2 hor 4h. Approximately 100-mg whole seed- 
lings were harvested at each time point and frozen in liquid nitrogen as 
independent biological samples. Total RNAs from various materials were 
extracted using a TRIzol kit (Invitrogen), then treated with TURBO DNase 
(Ambion). The first-strand complementary DNA was synthesized using 
oligo (dT) primers with the SSIII first-strand synthesis system (Invitrogen). 


Real-time PCR experiments were performed using gene-specific primers 
(Supplementary Table 8) ona CFX 96 real-time PCR detection system 
(BioRad). Arabidopsis ACTIN2 (ACT2) was used as the internal control. 
Three or four biological repeats were performed. To compare expres- 
sion profiles upon chemical treatment, relative expression levels under 
GR24*°°, GR24°"S or rac-GR24 treatment were normalized to expression 
values of solvent (mock) treatments at specific time points. To compare 
gene-expression levels in wild-type, mutant and transgenic plants, we 
normalized the relative expression levels in mutant and transgenic lines 
tothe expression values of the wild type. Gene-expression analysis in buds 
and 3-week-old plants are described in the figure legends. Experiments 
were repeated independently three times. 


Protein-degradation analysis 

35S:SMXL6-GFP transgenic plants were grown for 10 days, collected, and 
treated with 2 1M rac-GR24, GR24°°S, GR24*”° or GR24°""*° for the indi- 
cated times in half-strength MS liquid medium at 21 °C. Equal weights of 
plant materials were collected for protein extraction using lysis buffer 
(50 mM Tris-HCl at pH 7.5, 150 mM NaCl, 10% glycerol, 1% Nonidet P-40) 
containing 1x complete protease-inhibitor cocktail (Roche). Protein 
levels of SMXL6-GFP were detected by immunoblotting with mouse 
anti-GFP monoclonal antibody (Roche, 11814460001). Experiments 
were repeated independently three times. For gel source data, see 
Supplementary Fig. 1. 


ChIP analysis 

ChIP analyses were performed as described with minor modifications”. 
The SMXL6:SMXL6-HA 1 s678 transgenic line and s678 mutant were 
grown for 2 weeks under the long-day condition on a half-strength 
MS medium with 1% (w/v) sucrose. Approximately 6-g seedlings were 
ground into powder in liquid nitrogen and fixed in 1% (v/v) formal- 
dehyde for 30 min at 4 °C to crosslink protein-DNA complexes. The 
nuclear fraction was isolated and resuspended with ChIP lysis buffer 
containing 50 mM Tris-HCI, 10 mM EDTA and 1% (v/v) SDS at pH 8.0, 
then placed on ice for 30 min. Two volumes of ChIP dilution buffer 
containing 16.7 mM Tris-HCl, 167 mM NaCl and 1.1% (v/v) Triton X-100 at 
pH8.0 were added to the samples before sonication for 14 min (30s on, 
30s off, high level) in a Bioruptor (Diagenode, UCD-200) to yield DNA 
fragments of 300-500 base pairs in length. The lysates were diluted 
using seven volumes of ChIP dilution buffer and centrifuged at 16,000g 
for 15 min at 4 °C. Approximately 1.25% (v/v) of the sample was kept as 
input, and the remainder of the supernatant was mixed with an anti-HA 
polyclonal antibodies (Sigma, H6908) coupled to Dynabeads protein 
G (Life Technologies, 10003D), and incubated overnight at 4 °C. The 
protein-DNA complexes were washed and eluted, and then underwent 
reverse crosslinking. The DNA was precipitated in ethanol, recovered, 
dissolved in water and stored at —80 °C. 


Analysis of ChIP-seq data 

ChIP-seq libraries were prepared using the NextflexTM Rapid DNA-seq 
Kit (Bioo Scientific, 5144-02) and sequenced using the Illumina system 
Novaseq 6000 with a 150-bp read length. All ChIP-seq reads were 
mapped to the Arabidopsis genome TAIR10 using BWA (version 0.7.10- 
1789) software® with default parameters. Duplicated reads and reads 
with low mapping quality were discarded using SAMtools (version 0.1.19- 
44428cd)* with default parameters. Peaks in the SMXL6:SMXL6-HA 1 
S678 transgenic line were identified by comparison with the s678 mutant 
using the model-based analysis software MACS (version 1.4.2)*, with 
parameters ‘-mfold = 10, 30’, then determined by Pvalues of less than 
1x 10° and fold changes of more than 2.0. All detected peaks are listed 
in Supplementary Table 4 and were used for further analysis. 


ChIP-qPCR 
The prepared DNA in ChIP was applied in SsoFast EvaGreen super- 
mix (BioRad) with a BioRad CFX96 real-time PCR detection system. 


Chromatin of s678 seedlings was used as acontrol. The enrichment of 
target gene promoters is displayed as a percentage of the input DNA. 
The TUA2 promoter was used as a nonspecific target. All primers used 
in qPCR analyses are listed in Supplementary Table 8. Experiments were 
repeated independently three times. 


Recombinant protein purification and EMSAs 

The full-length coding sequences of SMXL6, 7, 8 were amplified and 
cloned into the expression vector pGEX 6p-1 (GE Healthcare). The GST, 
GST-SMXL6, GST-SMXL7 and GST-SMXL8 proteins were induced using 
0.3 mM isopropyl B-D-1-thiogalactopyranoside at 16 °C for 20 hin BL21 
transetta cells (Transgene). Fusion proteins were purified using glu- 
tathione sepharose 4 fast flow (GE Healthcare) and quantified by SDS- 
PAGE. Biotin 5-end-labelled DNA probes were synthesized or amplified 
using biotin 5-end-labelled primers. Binding reaction samples contain- 
ing 0.2 ng or 10 fmol of biotin-labelled double-stranded DNAs, 0.5 pg of 
the recombinant GST, GST-SMXL6, GST-SMXL7 or GST-SMXL8 protein 
and 1 pg poly(dIdC) were incubated at room temperature for 40 min 
and subjected to electrophoresis on 4% (w/v) polyacrylamide gels with 
half-strength TBE buffer. Biotin-labelled DNA was detected using the 
Lightshift chemiluminescent EMSA kit (Thermo Scientific). The super- 
shift band was detected in the presence of 3.7 pg anti-GST monoclonal 
antibody (Sigma, G1160). All primers and probes are listed in Supplemen- 
tary Table 8. Plasmid sequences have been deposited in GenBank (https:// 
www.ncbi.nim.nih.gov/genbank/) under the accession numbers listed 
in Supplementary Table 9. Experiments were repeated independently 
three or four times. For gel source data, see Supplementary Fig. 1. 


Leaf morphology analysis 

The fifth leaves of 3-week-old plants were harvested, laid flat on the sur- 
face of an agar medium plate, and photographed for further analysis. 
The leaf length (distance between the leaf tip and the base of petiole) 
and leaf width (the greatest distance across the leaf lamina perpen- 
dicular to the proximal/distal axis of the leaf) were measured manually 
using Image] (1.410) software (http://rsbweb.nih.gov/ij/) as described”®. 


Anthocyanin measurement 

Three-week-old Arabidopsis seedlings were collected, ground into 
powder in liquid nitrogen, quickly weighed and boiled for 3 min in 
extraction buffer (propanol:HCl:H,O =18:1:81). The anthocyanin con- 
tent is presented as (A535 — A650) per gram of fresh weight**. 


Quantification of ABA content 

For determination of the ABA content, unelongated buds in the axils of 
rosette leafs were carefully collected, frozen and groundin liquid nitrogen. 
Approximately 30 mg of frozen samples were weighed accurately, and 
extracted with 1 ml methanol containing 0.5 ng D,-ABA (internal standard) 
overnight at —20 °C. Samples were centrifuged for 15 min at 15,000gand 
4 °C, and the supernatant was collected, evaporated under nitrogen gas, 
and dissolved in ammonia solution (5% v/v). Then ABA was purified using 
Oasis MAX (Waters) solid-phase extraction cartridges. Endogenous ABA 
was analysed as described’ with some modifications. Liquid chroma- 
tography with tandem mass spectrometry analysis was performed on 
an ultraperformance liquid chromatography (UPLC) system (Waters) 
coupled to a 6500 Q-Trap system (AB SCIEX). Liquid-chromatography 
separation used a BEH C,, column (internal diameter 2.1mm x 100 mm, 
1.7 um; Waters) with mobile phases A (0.05% v/v acetic acid in water) and 
B(0.05% v/v acetic acid in acetonitrile). The gradient was set with an initial 
20% of mobile phase B, increased to 70% B within 6 min. ABA was detected 
in multiple reaction monitoring (MRM) mode with transition 263.0/153.1. 
Quantitation was performed using the isotope dilution method. 


Drought stress 
Plants were grown on half-strength MS plates for a week, transferred 
to pots containing a 1:1 vermiculite:soil mixture saturated with 


one-third-strength MS medium, and grown in greenhouses (16-h 
light/8-h dark cycle, 60-80 pE m~’s “light intensity, 21-22 °C) fora 
week before exposure to drought stress. Drought stress was carried 
out by withholding water until the lethal effect of dehydration was 
observed. The survival rate was calculated after rewatering for 3 days. 


Transcriptional activity assay in protoplasts 

To construct the recombinant plasmids used in transcriptional activ- 
ity assays, we amplified the promoters of SMXL6, SMXL7, SMXL8 and 
BRC1and inserted them into the 35SLUC vector using an In-Fusion PCR 
cloning kit (Clontech), with primers listed in Supplementary Table 8. To 
detect regulation by SMXL6 and SMXL6“"“* on the promoters of SMXL6, 
SMXL7, SMXL8 and BRC1, we introduced combinations of effector plas- 
mids (35S-GFP, 3xFlag-SMXL6 or 3xFlag-SMXL6-no-EAR), reporter plas- 
mids (pSMXL6-LUC, pSMXL7-LUC, pSMXL8-LUC or pBRC1-LUC) and 
reference plasmids (pRTL) into Arabidopsis protoplasts as described”. 
To examine the influence of GR24*°° on SMXL6-mediated repression at 
the promoters of SMXL6, 7, 8 and BRCI, we introduced combinations 
of effector plasmids (35S-GFP or 3xFlag-SMXL6), reporter plasmids 
(pSMXL6-LUC, pSMXL7-LUC, pSMXL8-LUC or pBRC1-LUC) and ref- 
erence plasmids (pRTL) into Arabidopsis protoplasts. After incuba- 
tion in the prescence or absence of 50 pM GR24*”° at 21 °C for 16h, 
luciferase activities were measured using a dual-luciferase reporter 
assay system (Promega). To detect SMXL6-mediated regulation of the 
SMXL7-WT and SMXL7-mu3 promoters, we introduced combinations of 
effector plasmids (35S-GFP or 3xFlag-SMXL6) and reporter plasmids 
(pSMXL7-WT-LUC or pSMXL7-mu3-LUC) into Arabidopsis protoplasts. 
After incubation at 21 °C for 12 h, luciferase activities were measured 
using the dual-luciferase reporter assay system (Promega). Plasmid 
sequences have been deposited in GenBank under the accession num- 
bers listed in Supplementary Table 9. Experiments were repeated inde- 
pendently three times. 


Plasmid construction and plant transformation 

To construct the pSMXL6-SMXL6-HA and pSMXL6-SMXL6-no-EAR-HA 
plasmids, we amplified the 3xHA sequence, the nopaline synthase 
(NOS) terminator sequence, and the promoter and coding sequences 
of SMXL6 and SMXL6“"“" and sequentially cloned them into the pCAM- 
BIA1300 vector. To construct p35S-TCP1-SRDX, we amplified the coding 
sequence of TCP1 and cloned it into the pWM101 vector. For CRISPR 
analysis of PAP1 and PAP2, the designed PAPs-targeting sequences 
were cloned into the AtU6-26-target-sgRNA vector, and then into the 
pYAO:hSpCas9 vector*’. For CRISPR analysis of TCP1, the designed 
TCP1-targeting sequences were cloned into the pHEE401 vector™. All 
recombinant plasmids were introduced into Agrobacterium tumefa- 
ciens strain EHA105 and transformed into indicated Arabidopsis recipi- 
ents as reported previously™. The primers are listed in Supplementary 
Table 8. GenBank accession numbers of plasmid sequences are listed 
in Supplementary Table 9. 


Genetic analysis 

The double, triple and quadruple mutants were generated by crossing 
relevant homozygous single or double mutants and genotyping the F, or 
F, progenies. The genotypes of smxl6, smxl7, smxl8 and max3-9 mutants 
were detected as described”. The brc1-6 mutant was genotyped using 
primer pairs of brc1-6 RP plus GABI-Kat and brcl-6 LP plus brc1-6 RP. The 
spl9-4 mutant was genotyped using primer pairs of spl/9-4 RP plus LB1 
and spl9-4 LP plus spl9-4 RP. The spl15-1 mutant was genotyped using 
primer pairs of spl15-1 RP plus LBb1 and spl15-1 LP plus spl15-1 RP. The 
primers are listed in Supplementary Table 8. 


Accession numbers 

Sequence data from this article can be found in GenBank/EMBL (https:// 
www.ncbi.nim.nih.gov/genbank/) under the following accession num- 
bers: ABCG37 (AT3G53480), ACT2 (AT3G18780), AFL1 (AT3G28270), 
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ANS (AT4G22880), BEE3 (AT1G73830), BGLU47 (AT4G21760), 
bHLH29 (AT2G28160), BRC1 (AT3G18550), DFR (AT5G42800), 
DLK2 (AT3G24420), DVLIS (AT3G46613), ELIPI (AT3G22840), 
EXP11 (AT1G20190), EXP12 (AT3G15370), HB40 (AT4G36740), 
HBS2 (AT5G53980), HECI (AT5G67060), HSFA6B (AT3G22830), 
IAA19 (AT3G15540), IAA29 (AT4G32280), KANI (AT5G16560), 
KUF1 (AT1G31350), MAX3 (AT2G44990), MAX4 (AT4G32810), 
MYB113 (AT1G66370), MYB114 (AT1G66380), MYB27 (AT3G53200), 
NRT2.6 (AT3G45060), PAPI (AT1G56650), PAP2 (AT1G66390), PARI 
(AT2G42870), REM36 (AT4G31620), RR16 (AT2G40670), SAUR3 
(AT4G34790), SAUR22 (AT5G18050), SAUR29 (AT3G03820), 
SAUR6]1 (AT1G29420), SAUR6S5 (AT1G29460), SMXL6 (AT1GO7200), 
SMXL7 (AT2G29970), SMXL8 (AT2G40130), SPL9 (AT2G42200), 
SPL15 (AT3G57920), STH7 (AT4G39070), TCPI (AT1G67260), TGA7 
(AT1G77920), TT7 (AT5GO7990), TUA2 (AT1G50010), WRKY38 
(AT5G22570), WRKY49 (AT5G43290), YUC3 (AT1G04610), AT1G64380, 
AT1G71520 and AT5G56840. 


Statistical analysis 

For gene-expression analyses, observations of phenotype, measure- 
ment of ABA and anthocyanins, ChIP-qPCR and assays of transcrip- 
tional activity, statistical analysis was assessed as described in the figure 
legends. Pvalues were calculated by two-sided Student’s t-tests using 
Excel 2016, or by Tukey’s HSD test using R (version 3.6.1), and are shown 
in bar graphs. Statistical analyses of RNA-seq and ChIP-seq data are 
described in the Methods sections ‘Chemical treatment and RNA-seq 
analysis’ and ‘Analysis of ChIP-seq data’. We used Fisher’s exact test 
to calculate the enrichment of GR24*>°-responsive genes amongst 
SMXL6-HA-targeted genes, and the y’ test to calculate the enrichment 
of ATAACAA in flanking sequences around the ChIP-seq peak sum- 
mits of GR24*”°-responsive genes. No statistical methods were used 
to predetermine sample size. The experiments were not randomized. 
The investigators were not blinded to allocation during experiments 
and outcome assessment. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Uncropped gels and blots are provided in Supplementary Fig. 1. The 
RNA-seq and ChIP-seq data have been deposited in the Gene Expression 
Omnibus (www.ncbi.nlm.nih.gov/geo/) under the accession numbers 
GSE126331 and GSE140705. Materials and reagents are available from 


the corresponding authors on request. Source data are provided with 
this paper. 
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Extended Data Fig. 1| Comparison of rac-GR24 and GR24 stereoisomers in 
stimulating strigolactone signalling. a, Chemical structures of GR24 
stereoisomers with different stereochemical features. b, Stability of SMXL6- 
GFP in 35S:SMXL6-GFP transgenic seedlings treated with 2 uM rac-GR24, 
GR24°°S, GR24*°° or GR24°""*"°, Proteins were detected by immunoblotting 
with an anti-GFP monoclonal antibody. Data represent three independent 
experiments. c, Expression of SMXL6 and SMXL8 in10-day-old wild-type 
seedlings pretreated in half-strength MS liquid medium for 2h, and then 


treated with 5 uM rac-GR24, GR24°"S or GR24*°° for Oh, 2hor4h. Data were 
normalized to mock treatment at specific time points and are means + s.e.m.; 
n=3 biologically independent samples. d, Induction of SMXL6 and SMXL8 upon 
GR24*”° treatmentis blocked by D/4 mutation. Data were normalized to mock 
treatment in wild-type seedlings at specific time points and are means +S.e.m.; 
n=3 biologically independent samples. c, d, Pvalues are shown; two-sided 
Student’s t-test. 
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Extended Data Fig. 2|See next page for caption. 
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Extended Data Fig. 2 | Verification of strigolactone-responsive genes by 
qRT-PCR analyses. a-e, g-i, Expression of strigolactone-responsive genes in 
10-day-old wild-type (Col-0) seedlings pretreated in half-strength MS liquid 
medium for 2h, and then treated with5 uM GR24*°? in half-strength MS liquid 
medium for 2 hor 4h. The qRT-PCR dataina-e, g-i are means +s.e.m.;n=3 
biologically independent samples (for SAUR22, SAUR29, SAUR61, AFL1, ELIP1, 
DLK2, TT7, ANS, HBS2, WRKY38, WRKY49, MYB27, MAX4) or 4 biologically 
independent samples (for SAUR3, SAUR6S, IAA19, I[AA29, HB40, YUC3, KUF1, 
STH7/BZS1, ATSGS6840, REM36, bHLH29, HSFA6B, AT1G64380, AT1G71520, HEC1, 
NRT2.6, BGLU47, DVL15, RR16, ABCG37, BEE3, EXP11, EXP12). Pvalues are shown; 
two-sided Student’s t-test. f, Fold change in transcription-factor-encoding 
genes in response to GR24*” treatment based on RNA-seq data. ABCG37, ATP- 
BINDING CASSETTE G37; AFLI, Atl4a-LIKE 1; ANS, ANTHOCYANIDIN SYNTHASE; 
BEE3, BR-ENHANCED EXPRESSION 3; BGLU47, BETA-GLUCOSIDASE 47; bHLH29, 


BASIC HELIX-LOOP-HELIX PROTEIN29; BRC1, BRANCHED I; DLK2, D14-LIKE 2; 
DVL1S, DEVIL 15; ELIP1, EARLY LIGHT-INDUCIBLE PROTEIN I; EXP11, EXP12, 
EXPANSIN 11, 12; HB40, HBS2, HOMEOBOX PROTEIN 40, 52; HEC1, HECATE 1; 
HSFA6B, HEAT SHOCK TRANSCRIPTION FACTOR AGB; IAAI9, IAA29, INDOLE-3- 
ACETICACID INDUCIBLE 19,29; KAN1, KANADI I; KUF1, KAR-UP F-BOX I; MAX4, 
MORE AXILLARY GROWTH 4; MYB27, MYB DOMAIN PROTEIN 27; NRT2.6, NITRATE 
TRANSPORTER 2.6; PAPI, PRODUCTION OF ANTHOCYANIN PIGMENT I; PARI, 
PHY RAPIDLY REGULATED I, REM36, REPRODUCTIVE MERISTEM 36; RR16, 
RESPONSE REGULATOR 16; SAUR3, SAUR22, SAUR29, SAUR61, SAUR65, SMALL 
AUXIN UPREGULATED RNA 3, 22,29, 61, 65; STH7/BZS1, SALT TOLERANCE 
HOMOLOG 7/BZR1-1D SUPPRESSOR 1; TCP1, TCP DOMAIN PROTEIN 1; TGA7, 
TGACG SEQUENCE-SPECIFIC BINDING PROTEIN 7; TT7, TRANSPARENT TESTA 7; 
WRKY38, WRKY49, WRKY DNA-BINDING PROTEIN 38, 49; YUC3, YUCCA 3. 


Article 


4 wCol-0; mock mCol-0; GR24420 b AFL1 
md14;mock ™d14; GR2440° Soe als 
5 1 
no 
< 3 
Ss 5 4.0 
2 3 
2 © 
rom 2 
Fs cc 2.0 
g id 
2 
2 0.0 
9 Ad 
oo g£ 


Extended Data Fig. 3 | SMXL6, 7, 8 negatively regulate AFL1 expression and 
drought tolerance. a, Induction of AFL1 upon GR24*”° treatment is blocked in 
the d14 mutant. Data were normalized to mock treatment in wild-type 
seedlings at specific time points and are means +s.e.m.;n=3 biologically 
independent samples. b, Expression of AFL1 in 10-day-old seedlings of the wild 
type (Col-0) and s678 triple mutant. Data were normalized to the wild type and 


50 


Survival rate (%) 


Col-0 


8678 


are means +s.e.m.;n=3 biologically independent samples. c, Phenotype and 
survival rate of wild-type and s678 plants. We exposed 2-week-old plants to 
drought stress by withholding of water for 2 weeks and then rewatered for 

3 days. Data are means +s.e.m.;n=3 pools (24 plants each pool). a-c, Pvalues 
are shown; two-sided Student’s t-test. 
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Extended Data Fig. 4| The EAR motif of SMXL6 is essential for regulation of 
shoot and leaf development. a, Phenotypes of wild-type (Col-0), 5678, 
S6:S6-HA S678 and S6:56“4"-HA s678 plants. S6:S6-HA 15678 and S6:S6-HA2s678 
represent independent transgenic lines with the $6:S6-HA transgene in the s678 
background; $6:S64""*-HA 15678 and S6:S6““""-HA 25678 represent independent 
transgenic lines containing the $6:S6“““*-HA transgene in the s678 background. 
Scale bar, 5cm.b, Quantitative analysis of shoot branching in the adult plants 
shownina. We counted the number of primary branches grown fromthe 


S6:S6-HA 2 S6:S6454R-HA 1 S6:S6454R-HA 2 


8678 


rosette leaf axil (left) and secondary branches grown from the cauline leaf axil 
(right) of at least 0.5 cm. Data are means +s.e.m.;n=20 plants. c, Ratio of leaf 
length to width for the fifth leaves of the wild-type (Col-O), s678, S6:S6-HA 5678 
and S6:S64"4"-HA s678 plants after growth for 3 weeks. Data are means +S.e.m.; 
n=15 leaves. d, Leaf morphology of 3-week-old plants. The fifth leaves are 
marked by white arrows. Scale bar, 1cm. b,c, Pvalues are shown; Tukey’s 
HSDtest. 
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Extended Data Fig. 5| BRC1 regulates strigolactone-mediated shoot 
branching. a, Expression of BRC1 in buds of primary rosette (RI) and secondary 
cauline (CII) branches. Data were normalized to the wild type (Col-O) and are 
means +s.e.m.;n=3 biologically independent samples. b, Schematic 
representation of the T-DNA insertion mutant brc1-6. The T-DNA insertion site 
and positions of RT-qPCR primers are indicated by white and black triangles, 
respectively. c, Expression of BRCI in 10-day-old seedlings using B1-1 and B1-2 
primer pairs. Data were normalized to the wild type and are means +s.e.m.;n=3 
biologically independent samples. d, Shoot-branching phenotypes of plants at 
the adult stage. Scale bar, 5cm. Data represent 15 independent experiments. 
We counted the number of primary rosette branches and secondary cauline 


branches of at least 0.5 cm. Data are means +s.e.m.;n=15 plants. e, f, Rosette 
leaf number (e) and the ratio of primary rosette branch to rosette leaf number 
(f) in adult Col-0, d14-1, s678, brc1-6, and s678 brc1-6 plants. Data are 

means +s.e.m.;n=14 plants. g, Induction of HB40 upon GR24*”° treatment is 
blocked in the d14 mutant. Data were normalized to mock treatmentin 
wild-type seedlings at specific time points and are means +s.e.m.;n=3 
biologically independent samples. h, ABA content in unelongated axillary buds 
of rosette leaves in Col-0, max3-9 and s678 plants. Data are means +s.e.m.;n=4 
biologically independent samples. a, c-h, Pvalues are shown; two-sided 
Student’s ¢-test. 
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Extended Data Fig. 6| TCP1 contributes to strigolactone-mediated 
leaf-shape regulation. a, Ratio of leaf length to width for the fifth leaves in 
wild-type (Col-0), 5678, brc1-6 and s678 brcl1-6 plants grown for 3 weeks. Data are 
means +s.e.m.;n=12 leaves. b, The EAR motif is required for SMXL6-mediated 
regulation of the expression of 7CP1in seedlings. Data were normalized to wild 
type and are means +s.e.m.;n=3 biologically independent samples. c, Left, 
morphology of the fifth leaves of 3-week-old plants. 35S:TCP1-SRDX 1 and 
35S:TCPI-SRDX 2 represent independent transgenic lines with the 
35S:TCP1-SRDX transgene in the Col-0 background; 35S:TCPI-SRDX 1s678 and 
35S:TCP1-SRDX 25678 represent independent transgenic lines with the 
35S:TCP1-SRDX transgene in the s678 background. Scale bar, 1cm. Right, ratio of 
leaf length to width for the fifth leaves. Data are means +s.e.m.;n=12 leaves. 

d, Morphology of the fifth leaves of wild-type, 5678, tcp1-1 and s678 tcp1-1 plants 
grown for 3 weeks. Scale bar, 1cm.e, Mutation site of the tcpJ-1 mutant. 

a-c, Pvalues are shown; Tukey’s HSD test (a, c) or two-sided Student’s t-test (b). 
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Extended Data Fig. 7 | Strigolactones promote expression of typeand are means +s.e.m.;n=3 biologically independent samples. 
anthocyanin-biosynthesis genes. a, Induction of PAP2, MYB113 and MYB114 e, Expression of PAPI, PAP2, MYB113, MYB114 and DFR in 3-week-old wild-type, 
upon GR24*° treatment is blocked inthe d14 mutant. Datawerenormalizedto 5678, S6:S6-HAS678 and S6:S6“““"-HA s678 plants. Data were normalized to wild 
mock treatment in wild-type seedlings at specific time points and are type and are means +s.e.m.;n=3 biologically independent samples. f, The 
means +s.e.m.;n=3 biologically independent samples. b,c, Anthocyanin papi-D mutant rescued the anthocyanin-biosynthesis deficiency of the max3-9 
content in Col-0 plants, strigolactone mutants and transgenic lines after mutant. Scale bar, 1mm. Data represent 18 independent experiments. 

growth for 3 weeks. Data are means +s.e.m.;n=3 pools (6 seedlings per pool). g, Mutation sites in the pap1-2 and pap2-1 mutants. a—e, Pvalues are shown; 

d, Expression of PAPI, PAP2, MYB113, MYB114 and DFR in 3-week-old wild-type two-sided Student’s t-test (a, b, d, e) or Tukey’s HSD test (c). 
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Extended Data Fig. 8 | Feedback inhibition of SMXL transcriptionin 
Arabidopsis. a, SMXL6 binds to the SMXL6 and SMXL8 promoters in ChIP-seq 
(left) and ChIP-qPCR assays (right). Red lines below the peak regions show the 
locations of probes used in ChIP-qPCR assays and EMSAs. Chromatin from 
S6:S6-HA 5678 and s678 seedlings was immunoprecipitated with anti-HA 
polyclonal antibodies. In ChIP—qPCR assays, the enrichment of target gene 
promoters is displayed as a percentage of input DNA. Values are means +s.e.m.; 
n=4 biologically independent samples. The TUA2 promoter was used asa 
nonspecific target. b, Diagrams showing the constructs used in transient 
expression assays in protoplasts. We included the -1598 to +402 region of 
SMXL6, -1607 to +393 region of SMXL7, -1890 to +110 region of SMXL8 and 
-2520 to +480 region of BRCI in pSMXL6-LUC, pSMXL7-LUC, pSMXL8-LUC and 


pBRCI-LUC constructs, respectively. The small black rectangle represents the 
EAR motif in SMXL6. Firefly LUC, firefly luciferase reporter gene; Renilla LUC, 
Renilla luciferase reporter gene. Firefly LUC activity was normalized against 
that of Renilla LUC. c, Transcriptional regulation by SMXL6 and SMXL6“*" of 
SMXL6 and SMXL8 promoters in transient expression assays. Data were 
normalized to samples expressing GFP and are means + s.e.m.; n= 3 biologically 
independent samples. d, e, Expression of SMXL6, SMXL7 and SMXL8 in 
nonelongated axillary buds of primary rosette (RI) branches and secondary 
cauline (CII) branches of Col-0 and strigolactone mutants. Data were 
normalized to wild type and are means + s.e.m.; n=3 biologically independent 
samples. a, c-e, Pvalues are shown; two-sided Student’s ¢-test. 
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Extended Data Fig. 9 | Domain mapping of the SMXL7 promoter bound by GST-SMXL6, GST-SMXL7 and GST-SMXL8 to the promoters of SMXL6, 7, 8in 
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Extended Data Fig. 10| Relationship between SMXL6, 7, 8 and SPL9, 15in 
shoot branching. a, b, Phenotypes and quantitative analysis of wild-type 
(Col-0), d14-1, 5678, sp19-4 spl15-1 and s678 sp19-4 sp115-1 plants grown for 6 
weeks. Scale bar, 3cm. Data are means +s.e.m.;n=14 plants. c, Phenotypes of 
Col-0, 5678, sp19-4 spl15-1 and 5678 sp19-4 spl15-1 plants grown for 3 weeks. Scale 
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Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


x| The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


x A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


[x]|[__| A description of all covariates tested 


x A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


x] A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
a AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


[x] For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


x For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 
x For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 
x Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection No software was used. 


Data analysis STAR (version 2.4.2a), FeatureCounts (version 1.5.0), R (version 3.3.1), and DESeq (version 1.26.0) were used in RNA-seq analysis; BWA 
(version 0.7.10-r789), SAMtools (version 0.1.19-44428cd), and MACS (version 1.4.2) were used in ChIP-seq analysis; EXCEL 2016 was used 


in two-sided Student's t-test; R (version 3.6.1) was used in Tukey's HSD test; and ImageJ (1.410) was used in measurement of leaf length 
and width. 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 
- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


The RNA-Seq and ChIP-Seq data have been deposited in the GEO (www.ncbi.nlm.nih.gov/geo/) database under the accession numbers GSE126331 and GSE140705. 
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Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No statistical methods were used to predetermine sample size. Sample size was chosen as large as possible and following the common 
practice in the field. We used 3 biological replicates for RNA-seq analysis and 2 biological replicates for ChIP-seq analysis, following the 


common practice in the field. In RT-qPCR and ChIP-qPCR assays, 3 or 4 biological replicates were used, following the common practice in the 
field. 


Data exclusions No data were excluded. 
Replication Each experiment was reproduced at least three times on separate occasions. Experimental findings were reliably reproduced. 


Randomization All samples were allocated randomly into experimental groups. 
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Blinding The investigators were not blinded to allocation during experiments and outcome assessment. In order to get as objective results as possible, 
in multiple experiments we had other researchers repeating the experiments. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
[_]|[¥] Antibodies [|| [¥] chiP-seq 
x Eukaryotic cell lines x Flow cytometry 
x Palaeontology x MRI-based neuroimaging 
*]|[_] Animals and other organisms 
x Human research participants 
x]|[_] Clinical data 
Antibodies 
Antibodies used Anti-GFP mouse ( #11814460001, clone 13.1 and 7.1, 1:2000 dilution for Western -Blotting, Roche, Lot 11751700) 
Anti-HA rabbit (HH6908, 1:25 dilution for ChIP, 120 ug per ChIP sample, sigma, Lot 106M4792V,) 
Anti-GST mouse (#G1160, clone GST-2, 1:40 dilution for EMSA assay, 3.7 ug per EMSA reaction, Sigma, Lot 035M4806V). 
Validation All antibodies used in this study were certified and validated by manufactures and vendors. 


The details about anti-GFP monoclonal antibody is in https://www.sigmaaldrich.com/catalog/product/roche/11814460001? 
lang=zh&region=CN 


The details about anti-HA polyclonal antibody is in https://www.sigmaaldrich.com/catalog/product/sigma/h6908? 
lang=zh&region=CN 


The details about anti-GST monoclonal antibody is in https://www.sigmaaldrich.com/catalog/product/sigma/g1160? 
lang=zh&region=CN 


ChIP-seq 


Data deposition 


x | Confirm that both raw and final processed data have been deposited in a public database such as GEO. 


x | Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks. 


Data access links We have submitted raw data, the WIG and BED files to GEO. The reviewer access link is https://www.ncbi.nlm.nih.gov/geo/ 
May remain private before publication. query/acc.cgi?acc=GSE140705. 
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Methodology 


Replicates 


Sequencing depth 


Antibodies 


Peak calling parameters 


Data quality 


Software 


We have uploaded the WIG and BED files to GEO. 


We have done two biological replicates. 


The sequencing depth for s678 SMXL6HA_ChIP_rep1, s678_ChIP_rep1, s678 SMXL6HA_ChIP_rep2, s678_ChIP_rep2 are 
49.44x, 34.24x, 44.32 x, and 48.88 x; the total number of reads for s678 SMXL6HA_ChIP_rep1, s678_ChIP_rep1, s678 
SMXL6HA_ChIP_rep2, s678_ChIP_rep2 are 82,765,244, 72,606,380, 87,725,990, and89,203,402 ; the properly mapped 
reads for s678 SMXL6HA_ChIP_rep1, s678_ChIP_rep1, s678 SMXL6HA_ChIP_rep2, s678_ChIP_rep2 are 72,511,020, 
62,392,227,422,476, and 67,640,961; the length of reads for s678 SMXL6HA_ChIP_rep1, s678_ChIP_rep1, s678 
SMXL6HA_ChIP_rep2, s678_ChIP_rep2 are 150, 150, 150, 150, and they are all paired-end. 


Anti-HA polyclonal antibody was bought from Sigma with catalog number H6908. 


Peaks in the SMXL6:SMXL6-HA s678 transgenic line were identified by the model-based analysis software MACS (version 
1.4.2) in comparison with peaks in the s678 mutant with the parameters '--mfold=10, 30', then determined by P-value less 
than 1e-5 and fold change more than 2.0. 


The Q30 percents for s678 SMXL6HA_ChIP_rep1_R1.fastq.gz, s678 SMXL6HA_ChIP_rep1_R2.fastq.gz, 
s678_ChIP_rep1_R1.fastq.gz, s678_ChIP_rep1_R2.fastq.gz, s678 SMXL6HA_ChIP_rep2_R1.fastq.gz, s678 
SMXL6HA_ChIP_rep2_R2.fastq.gz, s678_ChIP_rep2_R1.fastq.gz, s678_ChIP_rep2_R2.fastq.gz are 94.82, 94.28, 91.98, 91.67, 
93.14, 90.94, 93.21, and 91.22, indicating a good quality of sequencing. The numbers of peaks with P-value less than 1e-5 
and enrichment above 2-fold are 1,665 and 1,401 in replicate 1 and 2 of SMXL6-HA ChIP-seq. 


All ChIP-seq reads were mapped to the Arabidopsis genome TAIR10 using BWA (version 0.7.10-r789) software with default 
parameters. Duplicated reads and reads with low mapping quality were discarded using SAMtools (version 0.1.19-44428cd) 
with default parameters. Peaks for SMXL6-HA were identified by the model-based analysis software MACS (version 1.4.2) in 
comparison with s678 with the parameters ' --mfold=10, 30’. 
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The ongoing outbreak of viral pneumonia in China and across the world is associated 
with anew coronavirus, SARS-CoV-2'. This outbreak has been tentatively associated 
with a seafood market in Wuhan, China, where the sale of wild animals may be the source 
of zoonotic infection’. Although bats are probable reservoir hosts for SARS-CoV-2, the 
identity of any intermediate host that may have facilitated transfer to humans is 
unknown. Here we report the identification of SARS-CoV-2-related coronaviruses in 
Malayan pangolins (Manis javanica) seized in anti-smuggling operations in southern 
China. Metagenomic sequencing identified pangolin-associated coronaviruses that 
belong to two sub-lineages of SARS-CoV-2-related coronaviruses, including one that 
exhibits strong similarity in the receptor-binding domain to SARS-CoV-2. The discovery 
of multiple lineages of pangolin coronavirus and their similarity to SARS-CoV-2 suggests 
that pangolins should be considered as possible hosts in the emergence of new 
coronaviruses and should be removed from wet markets to prevent zoonotic 


transmission. 


An outbreak of serious pneumonia disease was reported in Wuhan, 
China, on 30 December 2019. The causative agent was soon identi- 
fied as a novel coronavirus’, which was later named SARS-CoV-2. Case 
numbers grew rapidly from 27 in December 2019 to 3,090,445 globally 
as of 30 April 2020°, leading to the declaration of a public health emer- 
gency, and later a pandemic, by the WHO (World Health Organization). 
Many of the early cases were linked to the Huanan seafood market 
in Wuhan city, Hubei province, from where the probable zoonotic 
source is speculated to originate”. Currently, only environmental 
samples taken from the market have been reported to be positive for 
SARS-CoV-2 by the Chinese Center for Disease Control and Prevention‘. 
However, as similar wet markets were implicated inthe SARS outbreak 
of 2002-2003°, it seems likely that wild animals were also involved in 
the emergence of SARS-CoV-2. Indeed, anumber of mammalian species 
were available for purchase in the Huanan seafood market before the 
outbreak‘. Unfortunately, because the market was cleared soon after 
the outbreak began, determining the source virus in the animal popu- 
lation from the market is challenging. Although a coronavirus that is 
closely related to SARS-CoV-2, which was sampled froma Rhinolophus 
affinis bat in Yunnan in 2013, has now been identified®, similar viruses 
have not yet been detected in other wildlife species. Here we 


identified SARS-CoV-2-related viruses in pangolins smuggled into 
southern China. 

We investigated the virome composition of pangolins (mammalian 
order Pholidota). These animals are of growing importance and interest 
because they are one of the most illegally trafficked mammal species: 
they are used as a food source and their scales are used in traditional 
Chinese medicine. A number of pangolin species are now regarded 
as critically endangered on the International Union for Conservation 
of Nature Red List of Threatened Species. We received frozen tissue 
samples (lungs, intestine and blood) collected from 18 Malayan pan- 
golins (Manis javanica) during August 2017-January 2018. These pan- 
golins were obtained during anti-smuggling operations performed by 
Guangxi Customs officers. Notably, high-throughput sequencing of 
the RNA of these samples revealed the presence of coronaviruses in 6 
out of 43 samples (2 lung samples, 2 intestinal samples, 1 lung-intes- 
tine mixed sample and 1 blood sample from 5 individual pangolins; 
Extended Data Table 1). With the sequence read data, and by filling 
gaps with amplicon sequencing, we were able to obtain six complete or 
near complete genome sequences—denoted GX/PI1E, GX/P2V, GX/P3B, 
GX/P4L, GX/P5E and GX/P5L-that fall into the SARS-CoV-2 line- 
age (within the genus Betacoronavirus of the Coronaviridae) ina 
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Fig. 1| Evolutionary relationships among sequences of human SARS-CoV-2, 
pangolin coronaviruses and the other reference coronaviruses. a, Genome 
organization of coronaviruses including the pangolin coronaviruses obtained 
in this study, with the predicted ORFs shown in different colours (ORFlais 
omitted for clarity). The pangolin coronavirus strain GX/P2V is shown with its 
sequence length. For comparison, the human sequences NC_045512.2 and 
NC_004718.3, and bat sequences MG772933.1, GQ153541.1and KC881006.1are 
included (see Extended Data Table 6 for sources). b, Phylogeny of the subgenus 
Sarbecovirus (genus Betacoronavirus; n= 53) estimated from the concatenated 
ORFiab, S,E, Mand Ngenes. Red circles indicate the pangolin coronavirus 
sequences generated in this study (Extended Data Table 1). GD/PIL is the 


phylogenetic analysis (Fig. 1b). The genome sequence of the virus 
isolate (GX/P2V) has a very high similarity (99.83-99.92%) to the five 
sequences that were obtained through the metagenomic sequencing of 
the raw samples, and all samples have similar genomic organizations to 
SARS-CoV-2, with eleven predicted open-reading frames (ORFs) (Fig. 1a 
and Extended Data Table 2; two ORFs overlap). We were also able to 
successfully isolate the virus using the Vero E6 cell line (Extended Data 
Fig. 1). On the basis of these genome sequences, we designed primers 
for quantitative PCR (qPCR) detection to confirm that the raw samples 
were positive for coronavirus. We conducted further qPCR testing on 
another batch of archived pangolin samples collected between May 
and July 2018. Among the 19 samples (9 intestine tissues, 10 lung tis- 
sues) tested from 12 animals, 3 lung tissue samples from 3 individual 
pangolins were positive for coronavirus. 

In addition to the animals from Guangxi, after the start of the 
SARS-CoV-2 outbreak researchers of the Guangzhou Customs Tech- 
nology Center re-examined five archived pangolin samples (two 
skin swabs, two unknown tissue samples and one scale) obtained 
in anti-smuggling operations performed in March 2019. Following 
high-throughput sequencing, the scale sample was found to contain 
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consensus sequence re-assembled from previously published raw data’. 
Phylogenies were estimated using a maximum likelihood approach that used 
the GTRGAMMA nucleotide substitution model and 1,000 bootstrap 
replicates. Scientific names of the bat hosts are indicated at the end of the 
sequence names, and abbreviated as follows: C. plicata, Chaerephon plicata; 
R. affinis, Rhinolophus affinis; R. blasii, Rhinolophus blasii; R.ferrumequinum, 
Rhinolophus ferrumequinum; R. monoceros, Rhinolophus monoceros; 

R. macrotis, Rhinolophus macrotis; R. pearsoni, Rhinolophus pearsoni; R. 
pusillus, Rhinolophus pusillus; R. sinicus, Rhinolophus sinicus. Palm civet 

(P. larvata, Paguma larvata; species unspecified for CivetO07 and PC4-13 
sequences) and human (H. sapiens, Homo sapiens) sequences are also shown. 


coronavirus reads, and from these data we assembled a partial genome 
sequence of 21,505 bp (denoted as GD/P2S), representing approxi- 
mately 72% of the SARS-CoV-2 genome. Notably, this virus sequence, 
obtained from a pangolin scale sample, may in fact be derived from 
contaminants of other infected tissues. Another study of diseased 
pangolins in Guangdong performed in 2019 also identified viral 
contigs from lung samples that were similarly related to SARS-CoV-2’. 
Different assembly methods and manual curation were performed 
to generate a partial genome sequence that comprised 86.3% of the 
full-length virus genome (denoted as GD/PIL in the phylogeny shown 
in Fig. 1b). 

These pangolin coronavirus genomes have 85.5% to 92.4% 
sequence similarity to SARS-CoV-2, and represent two sub-lineages 
of SARS-CoV-2-related viruses in the phylogenetic tree, one of which 
(comprising GD/P1L and GD/P2S) is very closely related to SARS-CoV-2 
(Fig. 1b). It has previously been noted that members of the subgenus 
Sarbecovirus have experienced widespread recombination®. In support 
of this, arecombination analysis (Fig. 2) revealed that bat coronavi- 
ruses ZC45 and ZXC21are probably recombinants, containing genome 
fragments derived from multiple SARS-CoV-related lineages (genome 
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a Genome position (bp) Fig. 2| Recombination analysis. a, Sliding window 
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between human SARS-CoV-2, pangolin and bat 
coronaviruses. The potential recombination 
breakpoints are shown in pink dash lines, and regions 
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regions 2, 5 and 7) as well as SARS-CoV-2-related lineages, including 
segments from pangolin coronaviruses (regions 1, 3, 4, 6 and 8). 
More notable, however, was the observation of putative recombi- 
nation signals between the pangolin coronaviruses, bat coronavirus 
RaTG13 and human SARS-CoV-2 (Fig. 2). In particular, SARS-CoV-2 
exhibits very high sequence similarity to the Guangdong pangolin 
coronaviruses in the receptor-binding domain (RBD) (97.4% amino acid 
similarity, indicated by red arrow in Fig. 2a; the alignment is shown in 
Fig. 3a), even though it is most closely related to bat coronavirus RaTG13 
inthe remainder of the viral genome. Indeed, the Guangdong pangolin 
coronaviruses and SARS-CoV-2 possess identical amino acids at the five 
critical residues of the RBD, whereas RaTG13 only shares one amino acid 
with SARS-CoV-2 (residue 442, according to numbering of the human 
SARS-CoV’) and these latter two viruses have only 89.2% amino acid 
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similarity inthe RBD. Notably, a phylogenetic analysis of synonymous 
sites only from the RBD revealed that the topological position of the 
Guangdong pangolinis consistent with that of the remainder of the viral 
genome, rather than being the closest relative of SARS-CoV-2 (Fig. 3b). 
Therefore, it is possible that the amino acid similarity between the RBD 
of the Guangdong pangolin coronaviruses and SARS-CoV-2 is due to 
selectively mediated convergent evolution rather than recombina- 
tion, although it is difficult to differentiate between these scenarios 
on the basis of the current data. This observation is consistent with 
the fact that the sequence similarity of ACE2 is higher between humans 
and pangolins (84.8%) than between humans and bats (80.8-81.4% for 
Rhinolophus sp.) (Extended Data Table 3). The occurrence of recom- 
bination and/or convergent evolution further highlights the role that 
intermediate animal hosts have in the emergence of viruses that can 
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Fig.3| Analysis of the RBD sequence. a, Sequence alignment showing the RBD 
in human, pangolin and bat coronaviruses. The five critical residues for binding 
between SARS-CoVRBD and human ACE? protein are indicated in red boxes, 
and ACE2-contacting residues are indicated by yellow boxes as previously 
described’. In the Guangdong pangolin-CoV sequence, the codon positions 
encoding the amino acids Pro337, Asn420, Pro499 and Asn519 have ambiguous 
nucleotide compositions, resulting in possible alternative amino acids at these 


infect humans. However, all of the pangolin coronaviruses identified 
to date lack the insertion ofa polybasic (furin-like) S1/S2 cleavage site 
inthe spike protein that distinguishes human SARS-CoV-2 from related 
betacoronaviruses (including RaTG13)”° and that may have helped 
to facilitate the emergence and rapid spread of SARS-CoV-2 through 
human populations. 

To our knowledge, pangolins are the only mammals in addition to 
bats that have been documented to be infected by a SARS-CoV-2-related 
coronavirus. It is notable that two related lineages of coronaviruses are 
found in pangolins that were independently sampled in different Chinese 
provinces and that both are also related to SARS-CoV-2. This suggests 
that these animals may be important hosts for these viruses, which is 
surprising as pangolins are solitary animals that have relatively small 
population sizes, reflecting their endangered status”. Indeed, onthe basis 
of the current data it cannot be excluded that pangolins acquired their 
SARS-CoV-2-related viruses independently from bats or another animal 
host. Therefore, their role inthe emergence of human SARS-CoV-2 remains 
to be confirmed. In this context, it is noteworthy that both lineages of 
pangolin coronaviruses were obtained from trafficked Malayan pangolins, 
which originated from Southeast Asia, and that there is a marked lack of 
knowledge of the viral diversity maintained by this species in regions in 
which it is indigenous. Furthermore, the extent of virus transmission 
in pangolin populations should be investigated further. However, the 
repeated occurrence of infections with SARS-CoV-2-related coronaviruses 
in Guangxi and Guangdong provinces suggests that this animal may have 
animportant role in the community ecology of coronaviruses. 

Coronaviruses, including those related to SARS-CoV-2, are present 
in many wild mammals in Asia> ””. Although the epidemiology, patho- 
genicity, interspecies infectivity and transmissibility of coronaviruses 
in pangolins remains to be studied, the data presented here strongly 
suggests that handling these animals requires considerable caution 
and their sale in wet markets should be strictly prohibited. Further 
surveillance of pangolins in their natural environment in China and 
Southeast Asia are necessary to understand their role in the emergence 
of coronaviruses and the risk of future zoonotic transmissions. 
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Methods 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and the investigators were not 
blinded to allocation during experiments and outcome assessment. 


Ethics statement 

The animals studied here were rescued and treated by the Guangxi 
Zhuang Autonomous Region Terrestrial Wildlife Medical-aid and Moni- 
toring Epidemic Diseases Research Center under the ethics approval 
(wild animal treatment regulation No. [2011] 85). The samples were 
collected following the procedure guideline (Pangolins Rescue Pro- 
cedure, November, 2016). 


Sample collection, viral detection and sequencing of pangolins 
in Guangxi 
We received frozen tissue samples of 18 pangolins (M. javanica) from 
Guangxi Medical University, China, that were collected between August 
2017 - January 2018. These pangolins were seized by the Guangxi 
Customs during their routine anti-smuggling operations. All animal 
individuals comprised samples from multiple organs including lungs, 
intestine and blood, with the exception of six individuals for which 
only lung tissues were available, five with mixed intestine and lung 
tissues only, one with intestine tissues only, and one comprising two 
blood samples. Using the intestine-lung mixed sample we were able to 
isolate anovel Betacoronavirus using the Vero-E6 cell line (from ATCC; 
Extended Data Fig. 1). The cell line was subjected to species identifica- 
tion and authentication by microscopic morphologic evaluation and 
growth curve analysis, and was tested free of mycoplasma contami- 
nation. The cell line was not on the list of common misidentified cell 
lines by ICLAC. A High Pure Viral RNA Kit (Roche) was used for RNA 
extraction on all 43 samples. For RNA sequencing (GX/P2V and GX/ 
P3B), a sequencing library was constructed using an lon Total RNA-Seq 
Kit v2 (Thermo Fisher Scientific), and the library was subsequently 
sequenced using an Ion Torrent S5 sequencer (Thermo Fisher Scien- 
tific). For other samples, reverse transcription was performed using 
an SuperScript III First-Strand Synthesis System for RT-PCR (Thermo 
Fisher Scientific). DNA libraries were constructed using the NEBNext 
Ultra II DNA Library Prep Kit and sequenced on a MiSeq sequencer. 
The NGS (next-generation sequencing) QC Toolkit V2.3.3 was used to 
remove low-quality and short reads. Both BLASTn and BLASTx were 
used to search against a local virus database, using the data available 
at NCBI/GenBank. Genome sequences were assembled using the CLC 
Genomic Workbench v.9.0. To fill gaps in high throughput sequencing 
and obtain the whole viral genome sequence, amplicon primers based 
on the bat SARS-like coronavirus ZC45 (GenBank accession number 
MG772933) sequence and the coronavirus contigs obtained in the initial 
sequencing were designed for further amplicon-based sequencing. 
A total of six samples (including the virus isolate) contained reads 
that matched members of the genus Betacoronavirus (Extended Data 
Table 1). We obtained near complete viral genomes from these samples 
(98%, compared to SARS-CoV-2), which were designated GX/PIE, GX/ 
P2V, GX/P3B, GX/P4L, GX/PSE and GX/PSL. Their average sequencing 
coverage ranged from approximately 8.4X to 8,478X (Extended Data 
Fig. 2a-f). On the basis of these genome sequences, we designed prim- 
ers for qPCR to confirm the positivity of the original tissue samples 
(Extended Data Table 4). This revealed an original lung tissue sample 
that was also qPCR positive, in addition to the six original samples with 
coronavirus reads. We further tested an additional 19 samples (nine 
intestine tissues and ten lung tissues), from 12 smuggled pangolins 
sampled between May-July 2018 by the group from Guangxi Medical 
University. The genome sequences of GX/PIE, GX/P2V, GX/P3B, GX/ 
P4L, GX/P5E and GX/P5L have been submitted to GISAID database and 
assigned accession numbers EPI_ISL_410538 - EPI_ISL_410543. 


Sample collection, viral detection and sequencing of pangolins 
in Guangdong 

After the start of the SARS-CoV-2 outbreak, the Guangzhou Customs 
Technology Center re-examined their five archived pangolin sam- 
ples (two skin swabs, two unknown tissue and one scale) obtained 
in anti-smuggling operations undertaken in March 2019. RNA was 
extracted from all five samples (Qiagen), and was subjected to 
high-throughput RNA sequencing on the Illumina HiSeq platform by 
Vision medicals. The scale sample was found to contain coronavirus 
reads using a BLAST-based approach. These reads were quality assessed, 
cleaned and assembled into contigs by both de novo (MEGAHIT 
v1.1.3) and using reference (BWA vO.7.13") assembly methods, using 
BetaCoV/Wuhan/WIV04/2019 as a reference. The contigs were com- 
bined, and approximately 72% of the coronavirus genome (21,505 bp) 
was obtained. This sequence has about 6.6 sequencing coverage 
(Extended Data Fig. 2g) and denoted pangolin-CoV GD/P2S. This 
sequence has been deposited on GISAID with accession number EPIL 
ISL_410544. 

Arecently published meta-transcriptomic study of pangolins’ depos- 
ited 21 RNA-seq raw files on the SRA database (https://www.ncbi.nIm. 
nih.gov/sra). We screened these raw read files using BLAST meth- 
ods and found that five (SRR10168374, SRR10168376, SRR10168377, 
SRR10168378 and SRR10168392) contained reads that mapped 
to SARS-CoV-2. These reads were subjected to quality assessment, 
cleaning and then de novo assembly using MEGAHIT” and reference 
assembly using BWA“. These reads were then merged and curated ina 
pileup alignment file to obtain the consensus sequences. This combined 
consensus sequence is 25,753 bp in length (about 86.3% of BetaCoV/ 
Wuhan/WIV04/2019; about 6.9x coverage) and denoted pangolin-CoV 
GD/PIL (available in the Supplementary Information Dataset). Notably, 
it has 66.8% overlap and a sequence identity of 99.79% with the GD/P2S 
sequence. As the genetic distance between these viruses is very low, 
for the recombination analysis we merged the GD/P1L and GD/P2S 
sequences into a single consensus sequence to minimize gap regions 
within any sequences. 

The viral genome organizations of the Guangxi and Guangdong pan- 
golin coronaviruses were similar to SARS-CoV-2. They possessed nine 
non-overlapping open reading frames (ORFs) plus two overlapping 
ORFs, and shared the same gene order of ORFlab replicase, envelope 
glycoprotein spike (S), envelope (E), membrane (M), nucleocapsid (N), 
plus other predicted ORFs. A detailed comparison of the ORF length 
and similarity with SARS-CoV-2 and bat coronavirus RaTG13 is provided 
in Extended Data Table 2. 


Sequence, phylogenetic and recombination analyses 

The human SARS-CoV-2 and bat RaTG13 coronavirus genome sequences 
were downloaded from Virological.org (http://virological.org) andthe 
GISAID (https://www.gisaid.org) databases in January 2020, with the 
data kindly shared by the submitters (Extended Data Table 5). Other 
coronaviruses (subgenus Sarbecovirus) were downloaded from Gen- 
Bank (Extended Data Table 6) and compared to those obtained here. We 
constructed a multiple sequence alignment of their complete genomes 
and individual genes using MAFFT v.7.273%. Maximum likelihood phy- 
logenies were estimated using RAXML v.8.2.12" from 100 inferences, 
using the GTRGAMMA model of nucleotide substitution with 1,000 
bootstrap replicates. To investigate potential recombination events, we 
used SimPlot v.3.5.1” to conduct a windowsliding analysis to determine 
the changing patterns of sequence similarity and phylogenetic cluster- 
ing between the query and the reference sequences. A full plot for the 
recombination analysis is provided in Extended Data Fig. 3. We also 
examined phylogenetic clusters performed directly from the multiple 
sequence alignment. Maximum likelihood trees were estimated from 
each window extraction (that is, genome regions 1 to 8) using RAXML 
as described above. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Data that support the findings of this study have been deposited inthe 
GISAID database (https://www.gisaid.org) with accession numbers 
EPI_ISL_410538-EPI_ISL_410544 and the SRA database under BioProject 
accession number PRJNA606875. The data are also available as Sup- 
plementary Information. 
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Extended Data Fig. 1| Microscopy image of the cytopathic effect of the The experiment was performed twice independently in two laboratories and 
virus in Vero E6 cells.a, Negative control. Uninfected cells of the Vero E6 cell produced similar results. 
line. b, Cytopathic effect seen in viral culture (five days after inoculation). 
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Extended Data Fig. 2 | Read coverage depth of each pangolin coronavirus analysed in this study. a, GX/P1E. b, GX/P2V. c, GX/P3B. d, GX/P4L. e, GX/PS5L. f, GX/ 
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Extended Data Fig. 3 | Recombination analysis of all members of the bat coronaviruses as described further in Fig. 2a. Sequence similarity patterns 
SARS-CoV-2-related lineages. Sliding window analysis of changes in the of Bat-CoV (RaTG13) and Guangxi pangolin-CoV are shown in this figure but not 
patterns of sequence similarity between human SARS-CoV-2, and pangolin and in Fig. 2a. 


Extended Data Table 1| High-throughput sequencing results of the pangolin samples with coronavirus reads 


Source location | Sample type abate seneiee front 
Guangxi Intestine GXIPIE | SANiNtd115045 
Guangxi ee GX/P2V Ver 4 m ue : 
samples 
Guangxi Blood GXIP3B_ | saNintdt isa 
Guangxi Lung GXIPAL | SANINTd115049 
Guangxi Intestine GXIPSE | sanintdt1s0a3 
cme [oe cies | A 
Guangdong | Seale GDIP2S_ | SAMINIZLIG618- 


Sequencing reads have been deposited in the SRA database under BioProject accession number PRJNA606875. 
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Extended Data Table 2 | Genomic comparison of SARS-CoV-2 with bat-CoV RaTG13, Guangdong pangolin-CoV and Guangxi 
pangolin-CoV 


Bat-Cov RaTG13* Guangdong pangolin CoV * Guangxi pangolin CoV * 
SARG-CBV- os oi Sn ecay " = om TER se 
2 (bp) Identity % Identity % (bp) Identity % Identity % (bp) Identity % Identity % 

ORFlab = 21287/21290 96.5 98.6 20076*/21290 90.7 97.1 21266/21290 84.9 92.5 
S 3810/3822 93.1 O77 3548*/3822 84.9 90.7 3804/3822 83.6 92.6 
ORF3a 828/828 96.3 97.8 828/828 93.6 97.4 828/828 87.0 89.3 
E 228/228 99.6 100 228/228 99.1 100 228/228 97.4 100 
M 666/669 95.9 100 669/669 93.4 98.6 669/669 91.3 98.2 
ORF6 186/186 98.4 100 186/186 Oot 96.6 186/186 90.9 95.0 
ORF7a 366/366 95.6 97.5 366/366 93.4 97.5 366/366 86.6 87.7 
ORF8 366/366 97.0 94.9 366/366 923 94.9 366/366 80.6 86.8 
N 1260/1260 96.9 99.0 1260/1260 96.2 97.8 1254/1260 91.4 94.3 


*Partial sequence. 
*Wuhan-Hu-1 SARS-CoV-2 (NC_045512.2) was used for comparison with bat-CoV RaTG13 (EPI_ISL_402131), Guangdong pangolin-CoV (merge of GD/P1L and GD/P2S) and Guangxi pangolin-CoV 
(GX/P5L). 


Extended Data Table 3 | Sequence similarity of amino acid sequences of ACE2 between humans, pangolins and bats 


Homo | Manis | Rhinolophus | Rhinolophus | Rhinolophus 
sapiens | javanica SINICUS pearsonii errumequinum 
100% | 
| Manis javanica| 34.85%| 100% | | | 


| _Rhinolophus sinicus | 80.75% | 82.86% | 100% | | 
| __Rhinolophus pearsonit | 81.37% | 82.98% | 94.41% | 100% | 
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Extended Data Table 4 | Primers used for qPCR detection of pangolin-associated coronaviruses 


pCov-Forward 


AGGTGACGAGGTTAGACAAATAG 


pCov-Reverse 


CCAAGCAATAACACAACCAGTAA 


pCov-Probe 


ACCCGGACAAACTGGTGTTATTGCT 


Extended Data Table 5 | Previously obtained SARS-CoV-2 genome sequences 


Accession ID Virus name Location | Collection | Originating lab Submitting lab Authors 
Virologal.org BetaCoV/Wuhan- | China / ona National Institute National Institute Zhang,Y.-Z., Wu,F., Chen, Y.-M., Pei, Y.-Y., 
sequence Hu-1/2019 Wuhan for Communicable | for Xu,L., Wang,W., Zhao,S., Yu,B., Hu,Y., 
(NC_045512.2) Disease Control Communicable Tao,Z.-W., Song,Z.-G., Tian,J.-H., Zhang, Y.- 
and Prevention Disease Control L., Liu,Y., Zheng,J.-J., Dai,F.-H., Wang,Q.- 
(ICDC) Chinese and Prevention M., She,J.-L. and Zhu,T.-Y. 
Center for Disease | (ICDC) Chinese 
Control and Center for 
Prevention (China Disease Control 
CDC) and Prevention 
(China CDC) 
EPIISL_ BetaCoV/bat/ China / 2013-07- Wuhan Institute of | Wuhan Institute Yan Zhu, Ping Yu, Bei Li, Ben Hu, Hao-Rui 
402131 Yunnan/ Yunnan 24 Virology, Chinese | of Virology, Si, Xing-Lou Yang, Peng Zhou, Zheng-Li Shi 
RaTG13/2013 Province Academy of Chinese Academy 
/ Pu'er Sciences of Sciences 
City 
EPLISL_ BetaCoV/Wuhan/ | China/ 2019-12- National Institute National Institute | Wenjie Tan >» Xuejun Ma » Xiang Zhao > 
402121 IVDC-HB- Hubei 30 for Viral Disease for Viral Disease Wenling Wang » Yongzhong Jiang » Roujian 
05/2019 Province Control and Control and Lu > Ji Wang > Peihua Niu, Weimin Zhou, 
/ Wuhan Prevention, China Prevention, China Faxian Zhan > Weifeng Shi > Baoying 
City CDC CDC 
Huang » Jun Liu » Li Zhao » Yao Meng > Fei 
Ye » Na Zhu, Xiaozhou He > Peipei Liu, 
Yang Li » Jing Chen » Wenbo Xu > George 
F. Gao > Guizhen Wu 
EPL ISL_ BetaCoV/Wuhan/ | China / 2020-01- National Institute National Institute | Wenjie Tan » Xiang Zhao » Wenling 
402120 IVDC-HB- Hubei 01 for Viral Disease for Viral Disease Wang > Xuejun Ma » Yongzhong Jiang » 
04/2020 Province Control and Control and Roujian Lu > Ji Wang » Weimin Zhou > 
/ Wuhan Prevention, China Prevention, China Peihua Niu » Peipei Liu » Faxian Zhan » 
City CDC CDC 
Weifeng Shi » Baoying Huang > Jun Liu > Li 
Zhao » Yao Meng » Xiaozhou He > Fei Ye » 
Na Zhu » Yang Li > Jing Chen » Wenbo 
Xu » George F. Gao » Guizhen Wu 
EPI_ISL_ BetaCoV/Wuhan/ | China / 2019-12- Wuhan Jinyintan Wuhan Institute Peng Zhou, Xing-Lou Yang, Ding-Yu Zhang, 
402124 WIV04/2019 Hubei 30 Hospital of Virology, Lei Zhang, Yan Zhu, Hao-Rui Si, Zhengli Shi 
Province Chinese Academy 
/ Wuhan of Sciences 
City 
EPI_ISL_ BetaCoV/Wuhan | China/ 2019-12- Institute of Institute of Lili Ren, Jianwei Wang, Qi Jin, Zichun 
402123 /IPBCAMS-WH- | Hubei 24 Pathogen Biology, | Pathogen Xiang, Zhiqiang Wu, Chao Wu, Yiwei Liu 
01/2019 Province Chinese Academy | Biology, Chinese 
/ Wuhan of Medical Academy of 
City Sciences & Peking | Medical Sciences 
Union Medical & Peking Union 
College Medical College 


SARS-CoV-2 genome sequences are available at virological.org and in the GISAID (https://www.gisaid.org) databases. 
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Extended Data Table 6 | GenBank accession numbers of coronavirus sequences used in this study 


Accession ID 


Strain name 


Host 


Publication 


NC_004718.3 Tor2 Homo sapiens He et al. Biochem Biophys Res Commun. 316(2):476-83 (2004) !® 
Snijder ef al. J. Mol. Biol. 331 (5), 991-1004 (2003) !° 
Marra et al. Science 300 (5624), 1399-1404 (2003) ° 
AY313906.1 GD69 Homo sapiens Song ef al. Proc. Natl. Acad. Sci. U. S. A. 102(7):2430-5 (2005) 7! 
MK211377.1 BtRs-BetaCoV/ Rhinolophus affinis Han et al. Front Microbiol. 10:1900 (2019)? 
YN2018C 
MK211376.1 BtRs-BetaCoV/ Rhinolophus affinis Han et al. Front Microbiol. 10:1900 (2019)? 
YN2018B 
MK211374.1 BtRI-BetaCoV/ Rhinolophus sp. Han et al. Front Microbiol. 10:1900 (2019)? 
SC2018 
KY352407.1 BtkK Y72 Rhinolophus sp. Tao et al. Microbiol Resour Announc 8 (28), e00548-19 (2019)? 
MG772934.1 bat-SL-CoVZXC21 | Rhinolophus sinicus Hu et al. Emerg Microbes Infect. 12;7(1):154 (2018) 74 
MG772933.1 bat-SL-CoVZC45 Rhinolophus sinicus Hu et al. Emerg Microbes Infect. 12;7(1):154 (2018) 74 
KY417151.1 Rs7327 Rhinolophus sinicus Hu et al. PLoS Pathog. 13 (11), 1006698 (2017) 5 
KY417147.1 Rs4237 Rhinolophus sinicus Hu et al. PLoS Pathog. 13 (11), €1006698 (2017) 
KY417146.1 Rs4231 Rhinolophus sinicus Hu et al. PLoS Pathog. 13 (11), 1006698 (2017) 
KY417143.1 Rs4081 Rhinolophus sinicus Hu et al. PLoS Pathog. 13 (11), €1006698 (2017) 5 
KJ473816. BtRs-YN2013 Rhinolophus sinicus Wuet al. J. Infect. Dis. 213 (4), 579-583 (2016) 7° 
Wu et al. ISME J 10 (3), 609-620 (2016) ?’ 
KJ473815. BtRs-GX2013 Rhinolophus sinicus Wuet al. J. Infect. Dis. 213 (4), 579-583 (2016) 7° 
Wu et al. ISME J 10 (3), 609-620 (2016) ?’ 
KJ473814. BtRs-HuB2013 Rhinolophus sinicus Wuet al. J. Infect. Dis. 213 (4), 579-583 (2016) 76 
Wu et al. ISME J 10 (3), 609-620 (2016) ?’ 
KJ473812. BtRf-HeB2013 Rhinolophus ferrumequinum Wuet al. J. Infect. Dis. 213 (4), 579-583 (2016) 7° 
Wu et al. ISME J 10 (3), 609-620 (2016) ?’ 
JX993988. Cp/Yunnan2011 Chaerephon plicata Yang et al. Emerging Infect. Dis. 19 (6) (2013) 8 
Wu et al. J. Infect. Dis. 213 (4), 579-583 (2016) 7° 
Wuet al. ISME J 10 (3), 609-620 (2016) ?7 
JX993987. Rp/Shaanxi2011 Rhinolophus pusillus Yang et al. Emerging Infect. Dis. 19 (6) (2013) 8 
Wuet al. J. Infect. Dis. 213 (4), 579-583 (2016) 7° 
Wu et al. ISME J 10 (3), 609-620 (2016) ?’ 
KU182964.1 JTMC15 Rhinolophus ferrumequinum Xu et al. Virol Sin 31 (1), 69-77 (2016) 7° 
KP886808. 1 YNLF_31C Rhinolophus ferrumequinum Journal information is not available in the GenBank record 
KF569996. 1 LYRall Rhinolophus affinis He et al. J. Virol. 88 (12), 7070-7082 (2014) *° 
KC881006.1 Rs3367 Rhinolophus sinicus Ge et al. Nature 503, 535-538 (2013) >! 
DQ412043. Rml Rhinolophus macrotis Li et al. Science 310 (5748), 676-679 (2005) 32 
DQ412042. Rfl Rhinolophus ferrumequinum Li et al. Science 310 (5748), 676-679 (2005) 2 
GU190215. BtCoV/BM48- Rhinolophus blasii Drexler ef al. J. Virol. 84 (21), 11336-11349 (2010) * 
31/BGR/2008 
GQ153547. HKU3-12 Rhinolophus sinicus Lau et al. J. Virol. 84 (6), 2808-2819 (2010) 34 
GQ153543. HKU3-8 Rhinolophus sinicus Lau et al. J. Virol. 84 (6), 2808-2819 (2010) 34 
GQ153541. HKU3-6 Rhinolophus sinicus Lau et al. J. Virol. 84 (6), 2808-2819 (2010) 34 
FJ588686.1 Rs672 Rhinolophus sinicus Yuan et al. J. Gen. Virol. 91 (PT 4), 1058-1062 (2010)*° 
DQ071615. Rp3 Rhinolophus pearsoni Li et al. Science 310 (5748), 676-679 (2005) * 
AY304488. SZ16 Paguma larvata Guan et al. Science 302 (5643), 276-278 (2003) *° 
DQ648856. BtCoV/273/2005 Rhinolophus ferrumequinum Tang et al. J. Virol. 80 (15), 7481-7490 (2006) *” 
AY572034. civet007 Palm civet (species unspecified) Wang et al. Emerging Infect. Dis. 11 (12), 1860-1865 (2005) * 
AY502924. TWll Homo sapiens Yeh et al. Proc. Natl. Acad. Sci. U.S.A. 101 (8), 2542-2547 (2004) *8 
AY613948. PC4 13 Palm civet (species unspecified) Song et al. Proc. Natl. Acad. Sci. U.S.A. 102 (7), 2430-2435 (2005) 7! 
AY613947. GZ0402 Homo sapiens Song ef al. Proc. Natl. Acad. Sci. U.S.A. 102 (7), 2430-2435 (2005) 7! 
AY559095. Sin847 Homo sapiens Vega et al. BMC Infect. Dis. 4, 32 (2004) *° 
KF294457.1 Longquan-140 Rhinolophus monoceros Journal information is not available in the GenBank record 
DQ648857. BtCoV/279/2005 Rhinolophus macrotis Tang et al. J. Virol. 80 (15), 7481-7490 (2006) *” 


Sequences were pub! 


ished previously °. 
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Sample size We screened all relevant pangolin samples that are available to us in the study period. Among the 43 Guangxi pangolin samples (18 pangolin 
individuals), 6 samples (5 pangolin individuals) were found with SARS-CoV-2 related coronavirus by sequencing. Among the 5 Guangdong 
pangolin samples, 1 was found with SARS-CoV-2 related coronavirus by sequencing. All these coronaviruses shared >99.7% genomic similarity 
to either some of them among themselves or the coronavirus found in previous study. Therefore, such sample size is sufficient for the 
discovery of SARS-CoV-2 related coronavirus in the pangolins in our conditions. 


Data exclusions No data were excluded. 


Replication qPCR was also applied on the same sets of samples that have been examined by metatranscriptomic sequencing, as to verify the presence of 
pangolin coronavirus sequence indicated by sequencing. 
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Randomization There was no separation of experimental groups in the study, hence no randomization. 


Blinding There was no separation of experimental groups in the study, hence no blinding. 
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Policy information about cell lines 


Cell line source(s) Vero E6 cells from ATCC. 


Authentication All Vero E6 cells were from ATCC with authentication. The authentication was performed by morphology check under 
microscopes and growth curve analysis. 


Mycoplasma contamination We confirm that all cells were tested as mycoplasma negative. 


Commonly misidentified lines No commonly misidentified cell lines were used. 
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Wild animals Thirty-five (18+12+5) pangolins were seized during routine anti-smuggling operations, and unfortunately dead for unknown 
reason in the rescue centre. Samples were then collected from them. 


Field-collected samples 0 field-collected samples were involved in the study. 


Ethics oversight The animals were rescued and treated by the Guangxi Zhuang Autonomous Region Terrestrial Wildlife Medical-aid and 
Monitoring Epidemic Diseases Research Center under the ethics approval (wild animal treatment regulation No. [2011] 85). The 
samples were collected following the procedure guideline (Pangolins Rescue Procedure, November, 2016). 
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The current outbreak of coronavirus disease-2019 (COVID-19) poses unprecedented 
challenges to global health’. The new coronavirus responsible for this outbreak— 
severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)—shares high 
sequence identity to SARS-CoV and a bat coronavirus, RaTG13”. Although bats may be 
the reservoir host for a variety of coronaviruses*’, it remains unknown whether 
SARS-CoV-2 has additional host species. Here we show that a coronavirus, which we 
name pangolin-CoV, isolated from a Malayan pangolin has 100%, 98.6%, 97.8% and 
90.7% amino acid identity with SARS-CoV-2 in the E, M, Nand S proteins, respectively. 
In particular, the receptor-binding domain of the S protein of pangolin-CoV is almost 
identical to that of SARS-CoV-2, with one difference in a noncritical amino acid. Our 
comparative genomic analysis suggests that SARS-CoV-2 may have originated in the 
recombination of a virus similar to pangolin-CoV with one similar to RaTG13. 
Pangolin-CoV was detected in 17 out of the 25 Malayan pangolins that we analysed. 
Infected pangolins showed clinical signs and histological changes, and circulating 
antibodies against pangolin-CoV reacted with the S protein of SARS-CoV-2. The 
isolation of acoronavirus from pangolins that is closely related to SARS-CoV-2 
suggests that these animals have the potential to act as an intermediate host of 
SARS-CoV-2. This newly identified coronavirus from pangolins—the most-trafficked 
mammal in the illegal wildlife trade—could represent a future threat to public health if 
wildlife trade is not effectively controlled. 


As coronaviruses are common in mammals and birds’, we used the 
whole-genome sequence of SARS-CoV-2 (strain WHCV; GenBank acces- 
sion number MN908947) ina Blast search of SARS-related coronavirus 
sequences in available mammalian and avian viromic, metagenomic 
and transcriptomic data. We identified 34 closely related contigsina 
set of viral metagenomes from pangolins (Extended Data Table 1), and 
therefore focused our subsequent search on SARS-related coronavi- 
ruses in pangolins. 

We obtained the lung tissues from 4 Chinese pangolins (Manis 
pentadactyla) and 25 Malayan pangolins (Manis javanica) froma wild- 
life rescue centre during March—August 2019, and analysed them for 
SARS-related coronaviruses using reverse-transcription polymerase 
chain reaction (RT-PCR) with primers that target a conservative region 
of betacoronaviruses. RNA from 17 of the 25 Malayan pangolins gener- 
ated the expected PCR product, whereas RNA from the Chinese pango- 
lins did not amplify. The virus-positive Malayan pangolins were all from 
the first transport. These pangolins were brought into the rescue centre 
at the end of March, and gradually showed signs of respiratory disease, 
including shortness of breath, emaciation, lack of appetite, inactivity 


and crying. Furthermore, 14 of the 17 pangolins that tested positive for 
viral RNA died within one and half months of testing. Plasma samples 
of four PCR-positive and four PCR-negative Malayan pangolins were 
used in the detection of IgG and IgM antibodies against SARS-CoV-2 
using a double-antigen sandwich enzyme-linked immunosorbent assay 
(ELISA). One of the PCR-positive sample reacted strongly, showing an 
optical density at 450 nm (OD,,;,.) value of 2.17 (cut-off value = 0.11) 
(Extended Data Table 2). The plasma remained positive at the dilution 
of 1:80, which suggests that the pangolin was naturally infected witha 
virus similar to SARS-CoV-2. The other three PCR-positive pangolins had 
no detectable antibodies against SARS-CoV-2. It is possible that these 
pangolins died during the acute stage of disease, before the appear- 
ance of antibodies. Histological examinations of tissues from four 
betacoronavirus-positive Malayan pangolins revealed diffuse alveolar 
damage of varying severity in the lung, compared with lung tissue from 
abetacoronavirus-negative Malayan pangolin. In one case, alveoli were 
filled with desquamated epithelial cells and some macrophages with 
haemosiderin pigments, with considerably reduced alveolar space, 
leading to the consolidation of the lung. In other cases, similar changes 
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Fig. 1| Pathological changes in the lungs of pangolins that are potentially 
induced by pangolin-CoV. a—d, Histological changes in the lung tissues are 
compared between a virus-negative Malayan pangolin (a) and three 

Malayan pangolins naturally infected with pangolin-CoV (b-d) (original 
magnification x 1,000). Proliferation and desquamation of alveolar epithelial 
cells and haemosiderin pigments are seen in tissues from all three infected 
pangolins and severe capillary congestion is seen in one of them (c).e, Viral 
particles are seen in double-membrane vesicles in the transmission electron 
microscopy image taken from Vero E6 cell culture inoculated with supernatant 
of homogenized lung tissue from one pangolin, with morphology indicative of 
coronavirus (inserts at the top right corner of e). Scale bar, 200 nm. 


were more focal (Fig. 1, Extended Data Fig. 1). The severe case also had 
exudate with red blood cells and necrotic cell debris in bronchioles and 
bronchi. Focal mononuclear-cell infiltration was seen in the bronchi- 
oles and bronchi in two of the cases, and haemorrhage was seen in the 
bronchioles and small bronchi in one case (Extended Data Figs. 1-3). 
Hyaline membrane and syncytia were not detected in the alveoli of the 
four cases we examined. 

To isolate the virus, supernatant from homogenized lung tissue from 
one dead Malayan pangolin was inoculated into Vero E6 cells. Obvious 
cytopathogenic effects were observed in cells after a 72-h incubation. 
Viral particles were detected by transmission electron microscopy: 
most of these particles were inside double-membrane vesicles, with 
a few outside of them. They showed typical coronavirus morphology 
(Fig. le). RT-PCR targeting the spike (S) and RdRp genes produced the 
expected PCR products: these PCR products had approximately 84.5% 
and 92.2% nucleotide sequence identity, respectively, to the partial 
Sand RdRp genes of SARS-CoV-2. 

Illumina RNA sequencing was used to identify viruses in the 
lung from nine pangolins. Mapping sequence data to the reference 
SARS-CoV-2 WHCV genome identified coronavirus sequence reads 
in seven samples (Extended Data Table 3). For one sample, higher 
genome coverage was obtained by remapping the total reads to the 
reference genome (Extended Data Fig. 4). We obtained the completed 
coronavirus genome (29,825 bp)—which we designated pangolin-CoV— 
using the assembled contigs, short sequence reads and targeted PCR 
analysis. The full S gene was sequenced in six PCR-positive samples, 
which revealed the presence of only four nucleotide differences inthe 
sequence alignment among these samples (Extended Data Fig. 5); this 
indicates that only one type of coronavirus was present in the batch of 
study samples. The predicted S, F, Mand Ngenes of pangolin-CoV are 


3,798, 228, 669 and 1,260 bp, respectively, in length and the proteins 
they encode share 90.7%, 100%, 98.6% and 97.8% amino acid identity 
to the equivalent proteins of SARS-CoV-2 (Table 1). 

Ina Simplot analysis of whole-genome sequences, we found that 
pangolin-CoV was highly similar to SARS-CoV-2 and RaTG13, with 
sequence identity between 80 and 98% (except for the S gene) (Fig. 2). 
Further comparative analysis of the S gene sequences suggests that 
there were recombination events among some of the SARS-related 
coronaviruses that we analysed. In the region of nucleotides 1-914, 
pangolin-CoV is more similar to the bat SARS-related coronaviruses 
ZXC21and ZC45, whereas inthe remaining part of the gene pangolin-CoV 
is more similar to SARS-CoV-2 and RaTG13 (Fig. 2). In particular, the 
receptor-binding domain (RBD) of the S protein of pangolin-CoV has 
only one amino acid difference with SARS-CoV-2. Overall, these data 
indicate that SARS-CoV-2 might have originated from the recombina- 
tion of a virus similar to pangolin-CoV and a virus similar to RaTG13 
(Fig. 2). To further support this conclusion, we assessed the evolu- 
tionary relationships among betacoronaviruses in the full genome, 
the RdRp and S genes, and in different regions of the S gene (Fig. 2c, 
Extended Data Fig. 6). The topologies mostly showed the clustering of 
pangolin-CoV with SARS-CoV-2 and RaTG13; SARS-CoV-2 and RaTG13 
forma subclade within this cluster (Fig. 2c). However, pangolin-CoV 
and SARS-CoV-2 grouped together in the phylogenetic analysis of the 
RBD. Conflicts in cluster formation among phylogenetic analyses of 
different regions of the genome serve as a strong indication of genetic 
recombination, as has previously been seen for SARS-CoV and Middle 
East respiratory syndrome coronavirus (MERS-CoV)°’. 

AstheS proteins of both SARS-CoV and SARS-CoV-2 have previously 
been shown to specifically recognize angiotensin-converting enzyme 2 
(ACE2) during the entry of host cells”*, we conducted molecular bind- 
ing simulations of the interaction of the S proteins of the four closely 
related SARS-related coronaviruses with ACE2 proteins from humans, 
civets and pangolins. As expected, the RBD of SARS-CoV binds effi- 
ciently to ACE2 from humans and civets in the molecular binding simu- 
lation. In addition, this RBD appears to be capable of binding ACE2 of 
pangolins. By contrast, the S proteins of SARS-CoV-2 and pangolin-CoV 
can potentially recognize only the ACE2 of humans and pangolins 
(Extended Data Fig. 7). 

SARS-CoV-2is one of three known zoonotic coronaviruses (the others 
are SARS-CoV and MERS-CoV) that infect the lower respiratory tract and 
cause severe respiratory syndromes in humans”. Thus far, SARS-CoV-2 
has been more contagious, but less deadly, than SARS-CoV": the total 
number of human infections by SARS-CoV-2 far exceeds those of 
SARS-CoV”. Epidemiological investigations of the SARS-CoV-2 outbreak 
have shown that some of the initial patients were associated with the 
Huanan seafood market, where live wildlife was also sold’®. No animals 
thus far have been implicated as carriers of the virus. SARS-CoV-2 forms 
acluster with SARS-CoV and bat SARS-related coronaviruses (Fig. 2c). In 
addition, abat coronavirus (RaTG13) has about 96% sequence identity 
to SARS-CoV-2 at the whole-genome level’. Therefore, it is reasonable 
toassume that bats are the native host of SARS-CoV-2, as has previously 
been suggested for SARS-CoV and MERS-CoV””, The SARS-related 
coronavirus identified in the present study and the metagenomic 


Table 1| Genomic comparison of pangolin-CoV with SARS-CoV-2, SARS-CoV and bat SARS-related coronaviruses 


Ss E M N Full-length genome 
WHCV 84.5 (90.7) 99.1 (100) 93.2 (98.6) 96.1 (97.8) 90.1 
SARS-CoV GDO1 72.2 (77.2) 93.5 (93.5) 85.8 (90.0) 87.5 (90.0) 81.6 
RaTG13 88.5 (89.8) 99.6 (100) 93.6 (99.1) 94.0 (96.7) 88.9 
ZC45 83.1 (86.1) 98.7 (100) 94.2 (99.6) 88.9 (93.3) 88.0 
ZXC21 81.1 (85.4) 98.7 (100) 94.2 (99.6) 88.9 (93.3) 88.4 


Numbers represent the percentage of nucleotides shared; numbers in parentheses represent percentage of amino acids shared. 
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Fig. 2| Genome characterization of pangolin-CoV. a, Similarity plot of the 
full-length genomes and S gene sequences of pangolin-CoV against sequences 
of SARS-CoV-2 strain WIVO2, as well as RaTG13, ZC45 and ZXC21. Although 
pangolin-CoV has a high sequence identity to SARS-CoV-2 and RaTG13 in most 
regions of the S gene, it is more similar to ZXC21 and ZC45 at the 5’ end. 
SARS-rCoV, SARS-related coronavirus. Parameters for the similarity plots are: 
window, 500 bp; step, 50 bp; gap strip, on; Kimura (2 parameter); 7/t2.0. 

b, Because of the presence of genetic recombination, there is discrepancy in 
cluster formation among the outcomes of phylogenetic analyses of different 


assemblies of viral sequences from Malayan pangolins"“ is genetically 
related to SARS-CoV-2, but is unlikely to be directly linked to the cur- 
rent outbreak because of its substantial sequence differences from 
SARS-CovV-2. However, a virus related to pangolin-CoV appears to have 
donated the RBD to SARS-CoV-2. SARS-related coronavirus sequences 
have previously been detected in dead Malayan pangolins®. These 
sequences appear to be from the same virus (pangolin-CoV) that we 
identified in the present study, as judged from their sequence similar- 
ity. Here we provide evidence for the potential for pangolins to act as 
the zoonotic reservoir of SARS-CoV-2-like coronaviruses. However, 
the pangolins we studied here showed clinical signs of disease. In gen- 
eral, a natural reservoir host does not show severe disease, whereas 
an intermediate host may have clinical signs of infection’®. Although 
a SARS-CoV-2-like coronavirus was detected in the lungs of these 
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regions of the S gene. c, Phylogeny of coronaviruses closely related to 
SARS-CoV-2, based on full genome sequences. The phylogenetic tree was 
constructed using RAXML with the substitution model GIRGAMMAI and 

1,000 bootstrap replicates. Numbers (>70) above or below branches are 
percentage bootstrap values for the associated nodes. The scale bar represents 
the number of substitutions per site. Red circles indicate the pangolin 
coronavirus sequences generated in this study, and blue triangles indicate 
SARS-CoV-2 sequences from humans. 


pangolins, a direct association between the clinical signs or pathol- 
ogy and active virus replication is not available as we lack evidence 
from immunohistochemistry or in situ hybridization experiments. 
The experimental infection of healthy pangolins with pangolin-CoV 
would provide more definitive answers; however, as pangolins are 
protected it is difficult to carry out such experiments. Further stud- 
ies are needed to confirm the role of pangolins in the transmission of 
SARS-related coronaviruses. 

AstheRBD of pangolin-CoVis nearly identical to that of SARS-CoV-2, 
the virus in pangolins presents a potential future threat to public health. 
Pangolins and bats are both nocturnal animals, eat insects and share 
overlapping ecological niches’”"’, which make pangolins an ideal inter- 
mediate host for some SARS-related coronaviruses. Therefore, more 
systematic and long-term monitoring of SARS-related coronaviruses 


in pangolins and related animals should be implemented to identify 
the potential animal source of SARS-CoV-2 in the current outbreak. 

Our findings support the call for stronger enforcement of regulations 
against the illegal trade in pangolins. Owing to the demand for their 
meat as a delicacy and their scales for use in traditional medicine in 
China, the illegal smuggling of pangolins from Southeast Asia to China 
is widespread’. International co-operation in the implementation of 
stricter regulations against illegal wildlife trade and consumption of 
game meat should be encouraged, as this will increase the protection 
of endangered animals and help to prevent future outbreaks of diseases 
caused by SARS-related coronaviruses. 
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Methods 


No statistical methods were used to predetermine sample size. 


Metagenomic analysis and viral genome assembly 

We collected viromic, metagenomic and transcriptomic data of different 
mammals and birds in public databases—including NCBI Sequence Read 
Archive (SRA) and European Nucleotide Archive (ENA)—for searching 
potential coronavirus sequences. The raw reads from the public data- 
bases and some in-house metagenomic datasets were trimmed using 
fastp (v.0.19.7)"’ to remove adaptor and low-quality sequences. The clean 
reads were mapped to the SARS-CoV-2 reference sequence (MN908947) 
using BWA-MEM (v.0.7.17)”° with >30% matches. The mapped reads were 
collected for downstream analyses. Contigs were de novo-assembled 
using Megahit (v.1.0.3)” and identified as related to SARS-CoV-2 using 
BLASTn with £-values <1 10° and sequence identity >90%. 


Samples 

Pangolins used in the study were confiscated by Customs and Depart- 
ment of Forestry of Guangdong Province in March and August 2019. 
They included four Chinese pangolins (M. pentadactyla) and 25 Malayan 
pangolins (M. javanica). The first transport confiscated contained 
21 Malayan pangolins, and the second transport contained 4 Malayan 
pangolins and 4 Chinese pangolins. These pangolins were sent to the 
wildlife rescue centre, and were mostly inactive and crying, and eventu- 
ally died in custody despite exhaustive rescue efforts. Tissue samples 
were taken from the lung of pangolins that had just died for histological 
and virological examinations. 


Pathological examinations 

Histological examinations were performed on lung tissues from five 
Malayan pangolins. In brief, the tissues collected were cut into small 
pieces and fixed in 10% buffered formalin for 24 h. They were washed 
free of formalin, dehydrated in ascending grades of ethanol and cleared 
with chloroform, and then embedded with molten paraffin wax ina 
template. The tissue blocks were sectioned with a microtome. The 
sections were transferred onto grease-free glass slides, deparaffinized 
and rehydrated through descending grades of ethanol and distilled 
water. They were stained with a haematoxylin and eosin staining kit 
(Baso Diagnostics, Wuhan Servicebio Technology). Finally, the stained 
slides were mounted with coverslips and examined under an Olympus 
BX53 equipped with an Olympus PM-C 35 camera. 


Virus isolation and RT-PCR analysis 
Lung tissue extract from pangolins was inoculated into Vero E6 cells for 
virus isolation. The cell line was tested free of mycoplasma contamina- 
tion using LookOut Mycoplasma PCR Detection Kit (SIGMA), and was 
authenticated by microscopic morphologic evaluation. Cultured cell 
monolayers were maintained in Dulbecco’s Modified Eagle Medium 
(DMEM) and Ham’s F-12. The inoculum was prepared by grinding the 
lung tissue in liquid nitrogen, diluting it 1:2 with DMEM, filtering it 
througha0.45-um filter (Merck Millipore), and treating it with 16 pg/ml 
trypsin solution. After incubation at 37 °C for 1h, the inoculum was 
removed from the culture and replaced with fresh culture medium. The 
cells were incubated at 37 °C and observed daily for cytopathic effects. 
Viral RNA was extracted from the lung tissue using the QlAamp Viral 
RNA Mini kit (Qiagen) following the manufacturer-recommended 
procedures, and examined for coronavirus by RT-PCR using 
a pair of primers (F: 5’-TGGCWTATAGGTT YAATGGYAT TGGAG-3’, 
R: 5’-CCGTCGATTGTGTGWATTTGSACAT-3’) designed to amplify the 
S gene of betacoronavirus. 


Transmission electron microscopy 
Cell cultures that showed cytopathic effects were examined for the viral 
particles using transmission electron microscopy. Cells were collected 


from the culture by centrifugation at 1,000g for 10 min, and fixed ini- 
tially with 2.5% glutaraldehyde solution at 4 °C for 4 h, and again with 
1% osmium tetroxide. They were dehydrated with graded ethanol and 
embedded with PON812 resin. Sections (80 nm in thickness) were cut 
from the resin block and stained with uranyl acetate and lead citrate 
sequentially. The negative stained grids and ultrathin sections were 
observed under a HT7800 transmission electron microscope (Hitachi). 


Serological test 

Plasma samples from eight Malayan pangolins were tested for 
anti-SARS-CoV-2 antibodies using a double-antigen ELISA kit for the 
detection of antibodies against SARS-CoV-2 by Hotgen, following 
manufacturer-recommended procedures. The assay was designed 
for the detection of both IgG and IgM antibodies against SARS-CoV-2 
in humans and animals, and marketed as supplementary diagnostic 
tool for COVID-19. It uses the capture of antibodies against SARS-CoV-2 
by the S1 antigen precoated on ELISA plates, and the detection of the 
antibodies through the use of horseradish peroxidase-conjugated RBD. 
Both the S1 antigen and RBD fragment were expressed in eukaryotic 
cells. Data generated by the test developer have showna 95% detection 
rate in the analysis of sera from over 200 patients with COVID-19s. 
The assay has an inter-test variation of <15%, and no cross-reactivities 
with sera or plasma from patients positive for SARS-CoV, common and 
avian influenza viruses, mycoplasma and chlamydia. Fifty microlitres 
of plasma was analysed in duplicate, together with two negative con- 
trols and one positive control. The reaction was read ona Synergy HTX 
Multi-Mode Microplate Reader (BioTek) at 450/630 nm, with optical 
density (OD) values being calculated. The cut-off OD value for positivity 
was 0.105 + mean OD from the negative controls, and the cut-off value 
for OD for the positive control was set at > 0.5. Positive samples were 
tested again with serial-diluted plasma. 


Metagenomic sequencing 

The lung tissue was homogenized by vortex with silica beads in 1 ml 
of phosphate-buffered saline. The homogenate was centrifuged at 
10,000g for 5 min, with the supernatant being filtered through a 
0.45-um filter (Merck Millipore) to remove large particles. The filtrate or 
virus culture supernatant was used in RNA extraction with the QlAamp 
Viral RNA Mini kit. cDNA was synthesized from the extracted RNA using 
PrimeScriptScript II reverse transcriptase (Takara) and random primers, 
and amplified using Klenow Fragment (New England Biolabs). Sequenc- 
ing libraries were prepared with NEBNext Ultra DNA Library Prep Kit for 
Illumina (New England Biolabs), and sequenced paired-end (150-bp) 
on an Illumina NovaSeq 6000. Specific PCR assays were used to fill 
genome sequence gaps, using primers designed based on sequences 
flanking the gap. 


Phylogenetic analysis 

Multiple sequence alignments of all sequence data were constructed 
using MAFFT v.7.221”. The phylogenetic relationship of the viral 
sequences was assessed using RAXML v.8.0.14”. The best-fit evolu- 
tionary model for the sequences in each dataset was identified using 
ModelTest™. Potential recombination events and the location of pos- 
sible breakpoints in betacoronavirus genomes were detected using 
Simplot (version 3.5.1)” and RDP 4.997%, 


Molecular simulation of interactions between RBD and ACE2 

The interaction between the RBD of the S protein of SARS-related 
coronavirus and the ACE2 of humans, civets, and pangolins was exam- 
ined using molecular dynamic simulation. The crystal structure of 
SARS-CoV RBD domain binding to human ACE2 protein complex was 
downloaded from Protein Data Bank (PDB code 2AJF”’). The structures 
ofthe complexes formed by ACE2 of civets or pangolins and the RBD of 
SARS-CoV-2, RaTG13 and pangolin-CoV were made using the MODEL- 
LER program’s, and superimposed with the template (PDB code 2AJF). 


The sequence identity of SARS-CoV RBD (PDB code 6ACD) tothe RBD 
of SARS-CoV-2, RaTG13 and pangolin-CoV was 76.5%, 76.8% and 74.2%, 
respectively, and the sequence identity of the human ACE2 protein to 
that of pangolins and civets was 85.4% and 86.9%, respectively. 

The molecular dynamic simulations of RBD-ACE2 complexes were 
carried out using the AMBER 18 suite” and ff14SB force field*°. After 
two-stage minimization, NVT and NPT-MD, a30-ns production molecu- 
lar dynamics simulation was applied, with the time step being set to 2 
fs and coordinate trajectories being saved every 3 ps. The MM-GBSA”" 
approach was used to calculate the binding free energy of each ACE2 
protein to the RBD of theS protein, using the python script MMPBSA. 
py” in the build-in procedure of AMBER 18 suite. The last 300 frames 
of all simulations were extracted to calculate the binding free energy 
that excludes the contributions of disulfide bond. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 
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Extended Data Fig. 1| Pathological changes in the lungs of pangolins. seeninthe lung tissues from three infected pangolins (b-d). Exudate is seenin 
a-d, Lungs of three Malayan pangolins naturally infected with pangolin-CoV the bronchi of one infected pangolin (b). Severe congestion is seen in the lung 
(b-d, original amplification x 100) in comparison with the lung froma of one pangolin (d). 


virus-negative Malayan pangolin (a). Different degrees of consolidation are 


——w 
A, 


Extended Data Fig. 2| Pathological changesinthe bronchiole of pangolins. |§ Mononuclear cellinfiltration is seenin the bronchiole wall of one infected 

a-d, Three Malayan pangolins positive for pangolin-CoV (b-d, original pangolin (c). Severe congestion is seen in the alveolar tissue (in close proximity 
amplification x 100) incomparison witha virus-negative Malayan pangolin (a). tothe bronchiole) of one pangolin (d). The respiratory epithelium inthe 

Red blood cells are seen inthe bronchioles of two infected pangolins (b, d). bronchioles is intact. 
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Extended Data Fig. 3 | Pathological changes in the bronchus of pangolins. Macrophages with haemosiderin pigments and mononuclear cell 

a-d, Three Malayan pangolins positive for pangolin-CoV (b-d, original infiltration are seen inthe bronchus wall of two infected pangolins 
amplification x 100) in comparison witha virus-negative Malayan pangolin (a). (cand d, respectively). The respiratory epithelium in the bronchiis intact. 
Exudate with red blood cells is seen in the bronchus of one infected pangolin (b). 
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Extended Data Fig. 4| Mapping raw reads from pangolin lung to pangolin-CoV genome. Results of the mapping of raw reads from the high-throughput 
sequencing of the pangolin lung tissue to the assembled pangolin-CoV genome. 
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Pangolin-CoV TCTAACCAAGTGGCTGTTCTTTATCAGGATGTTAACTGCACTGAAGTCCCTGTTGCTATTCATGCAGATCAATTAACACCAACCTGGAGIGTTTACTCTA 
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Pangolin-CoV CAGGTTCAAATGTTTTTCAAACGCGTGCAGGCTGTTTAATAGGGGCTGAACATGTTAACAACTCTTACGAGTGTGACATACCAATTGGIGCAGGAAT. 
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Extended Data Fig. 5 | Sequence polymorphism among nucleotide sequences of the full S gene among virus-positive lung samples from six pangolins. Dots 
denote nucleotide identity to the reference sequence. 
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Extended Data Fig. 6| Phylogeny of coronaviruses closely related to 
SARS-CoV-2. a, Based on nucleotide sequences of the S gene. b, Based on 
nucleotide sequences of the RdRp gene. The phylogenetic trees were 
constructed by RAXML with the substitution model GIRGAMMAI and 
1,000 bootstrap replicates. Numbers (>70) above or below branches are 
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percentage bootstrap values for the associated nodes. The scale bar represents 


the number of substitutions per site. Red circles indicate the pangolin 
coronavirus sequences generated in this study, and blue triangles indicate 
SARS-CoV-2 sequences from humans. 
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Extended Data Fig. 7 | Molecular binding simulations of the interaction of acids involved in interactions with ACE2 are boxed) of the S proteins from 
the S proteins of four closely related SARS-related coronaviruses with the several genetically related SARS-related coronaviruses. c, Alignment of partial 
ACE2 proteins of humans, civets and pangolins. a, Free energy (kcal mol) ACE2 amino acid sequences (key amino acids involved in interactions with RBD 
for the binding of the RBD of S proteins of four SARS-related coronaviruses to are marked with arrowheads) from humans, pangolins and civets at their 


the ACE2 of potential hosts. b, Alignment of the RBD sequences (key amino interface with the RBD ofS proteins. 


Extended Data Table 1| Results of Blast search of SARS-related coronavirus sequences in available mammalian and avian 
viromic, metagenomic and transcriptomic data using the SARS-CoV-2 sequence (GenBank accession number MN908947) 


Contig name Sequence identity (%) Length (bp) E-value Data source 

con_9 98.623 363 0 PRJNA573298 
con_15 97.303 519 0 PRJNA573298 

con_2 96.97 203 2.41E-44 PRJNA573298 
con_19 96.855 477 0 PRJNA573298 
con_10 96.562 349 4.64E-168 PRJNA573298 

con_1 95.96 203 1.12E-42 PRJNA573298 
con_13 95.918 373 7.65E-42 PRJNA573298 
con_11 95.804 402 5.63E-133 PRJNA573298 

con_4 94.737 231 1.00E-38 PRJNA573298 
con_25 94.737 779 0 PRJNA573298 
con_22 94,297 655 5.73E-115 PRJNA573298 

con_7 94.098 305 1.18E-133 PRJNA573298 
con_16 94.041 387 3.08E-170 PRJNA573298 
con_21 93.971 680 0 PRJNA573298 

con_6 93.96 304 9.13E-130 PRJNA573298 
con_24 93.74 610 0 PRJNA573298 
con_20 93.343 721 0 PRJNA573298 
con_12 93.333 373 4.14E-114 PRJNA573298 
con_30 93.321 1048 0 PRJNA573298 
con_29 92.892 886 0 PRJNA573298 

con_5 92.632 231 2.17E-35 PRJNA573298 
con_31 92.495 635 0 PRJNA573298 
con_23 92.354 669 0 PRJNA573298 
con_32 91.942 1031 0 PRJNA573298 
con_34 91.884 1687 0 PRJNA573298 
con_27 91.844 846 0 PRJNA573298 

con_8 91.776 304 1.99E-121 PRJNA573298 

con_3 91.705 218 6.83E-85 PRJNA573298 
con_14 91.436 410 5.55E-158 PRJNA573298 
con_18 91.429 385 5.23E-153 PRJNA573298 
con_33 91.358 1177 7.06E-27 PRJNA573298 
con_17 91.02 491 0 PRJNA573298 
con_26 90.921 740 0 PRJNA573298 


con_28 90.814 840 0 PRJNA573298 
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Extended Data Table 2 | OD values (at 450 and 630 nm) of 
ELISA testing of SARS-CoV-2 antibodies in plasma samples 
from eight pangolins 


Repetition 


Repetition 1 Repetition 2 Average 


1 2.253 2.088 2.1705 
2 0.014 0.012 0.013 
3 0.013 0.012 0.0125 
4 0.023 0.025 0.024 
5 0.01 0.012 0.011 
6 0.011 0.011 0.011 
if 0.028 0.028 0.028 
8 0.053 0.052 0.0525 


Negative control 0.01 0.01 0.01 


Extended Data Table 3 | Identification of SARS-related 
coronavirus sequence reads in metagenomes from the lung 
of pangolins using the SARS-CoV-2 sequence (GenBank 
accession number MN908947) as the reference 


Sample ID Animal species Total reads* No. mapped 
M1 Malayan pangolin 107,267,359 496 
M2 Malayan pangolin 38,091,846 302 
M3 Malayan pangolin 79,477,358 14 
M4 Malayan pangolin 32,829,850 1,100 
M5 Malayan pangolin 547,302,862 56 
M6 Malayan pangolin 232,433,120 10 
M8 Malayan pangolin 44,440,374 12 
M10 Malayan pangolin 227,801,882 0 
Z1 Chinese pangolin 444,573,526 0 


* 150-bp paired-end reads 
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Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is anewly emerged 
coronavirus that is responsible for the current pandemic of coronavirus disease 2019 
(COVID-19), which has resulted in more than 3.7 million infections and 260,000 deaths 
as of 6 May 2020”. Vaccine and therapeutic discovery efforts are paramount to curb 
the pandemic spread of this zoonotic virus. The SARS-CoV-2 spike (S) glycoprotein 
promotes entry into host cells and is the main target of neutralizing antibodies. Here 
we describe several monoclonal antibodies that target the S glycoprotein of 
SARS-CoV-2, which we identified from memory B cells of an individual who was 
infected with severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003. 
One antibody (named S309) potently neutralizes SARS-CoV-2 and SARS-CoV 
pseudoviruses as well as authentic SARS-CoV-2, by engaging the receptor-binding 
domain of the S glycoprotein. Using cryo-electron microscopy and binding assays, we 
show that S309 recognizes an epitope containing a glycan that is conserved within the 
Sarbecovirus subgenus, without competing with receptor attachment. Antibody 
cocktails that include S309 in combination with other antibodies that we identified 
further enhanced SARS-CoV-2 neutralization, and may limit the emergence of 
neutralization-escape mutants. These results pave the way for using S309 and 
antibody cocktails containing S309 for prophylaxis in individuals at a high risk of 
exposure or as a post-exposure therapy to limit or treat severe disease. 


The entry of coronaviruses into host cells is mediated by the transmem- 
braneS glycoprotein, which forms homotrimers that protrude from the 
viral surface®. The S glycoprotein comprises two functional subunits: 
S, (divided into A, B, C and D domains), which is responsible for bind- 
ing to host-cell receptors; and S,, which promotes fusion of the viral 
and cellular membranes**. Both SARS-CoV-2 and SARS-CoV belong to 
the Sarbecovirus subgenus and their S glycoproteins share 80% amino 
acid sequence identity®. SARS-CoV-2 S glycoprotein is closely related 
to the bat SARS-related coronavirus RaTG13 S, with which it shares 
97.2% amino acid sequence identity’. It has recently been demonstrated 
that, in humans, angiotensin converting enzyme 2 (ACE2) is a func- 
tional receptor for SARS-CoV-2, as also is the case for SARS-CoV"* ®. 
Domain B of subunit S, (S®°) is the receptor-binding domain (RBD) of 
the S glycoprotein, and binds to ACE2 with high affinity, which pos- 
sibly contributed to the current rapid transmission of SARS-CoV-2 in 
humans°” as was previously proposed for SARS-CoV”. 

As theS glycoprotein of coronaviruses mediates entry into host cells, 
it is the main target of neutralizing antibodies and the focus of efforts 


to design therapeutic agents and vaccines*. The S-glycoprotein trimers 
are extensively decorated with N-linked glycans that are important for 
protein folding” and modulate accessibility to host proteases and neu- 
tralizing antibodies” ”. Previous cryo-electron microscopy (cryo-EM) 
structures of the SARS-CoV-2S glycoprotein in two distinct functional 
states®’—along with cryo-EM and crystal structures of the SARS-CoV-2 
S® in complex with ACE2'**°—have revealed dynamic states of the S° 
domains, providing a blueprint for the design of vaccines and inhibi- 
tors of viral entry. 

Passive administration of monoclonal antibodies (mAbs) could have 
a major effect on controlling the SARS-CoV-2 pandemic by providing 
immediate protection, complementing the development of prophylac- 
tic vaccines. Accelerated development of mAbs ina pandemic setting 
could be reduced to 5-6 months, compared to the traditional timeline 
of 10-12 months”. The recent finding that ansuvimab (mAb114) isa safe 
and effective treatment for symptomatic infection with Ebola virus 
is anotable example of the successful use of mAb therapy during an 
outbreak of infectious disease”””’. Potently neutralizing human mAbs 
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Table 1| Characteristics of the antibodies described in this study 


mAb VH (per cent identity) HCDR3 sequence VL (per cent identity) SARS-CoV SARS-CoV-2___ Binding 
S110 VH3-30 (96.88) AKDRFQFARSWYGDYFDY VK2-30 (96.60) + + RBD and non-RBD 
$124 VH2-26 (98.28) ARINTAAYDYDSTTFDI VK1-39 (98.57) + + RBD 
$109 VH3-23 (93.75) ARLESATQPLGYYFYGMDV VL3-25 (97.85) + - RBD 

S111 VH3-30 (95.14) ARDIRHLIVVVSDMDV VK2-30 (98.30) + - RBD 
$127 VH3-30 (96.53) AKDLFGYCRSTSCESLDD VK1-9 (98.92) + - RBD 
$215 VH3-30 (90.28)) ARETRHYSHGLNWFDP VK3-15 (98.92) + - RBD 
$217 VH3-49 (95.58) SWIHRIVS VK1-33 (98.21) + - RBD 
$218 VH3-30 (93.40) ARDVKGHIVVMTSLDY VK2-30 (97.62) + - RBD 
$219 VH1-58(92.01) AAEMATIQNYYYYYGMDV VK1-39 (95.34) + - RBD 
$222 VH1-2 (91.67) ARGDVPVGTGWVFDF VK1-39 (92.47) + - RBD 
$223 VH3-30 (95.14) ATVSVEGYTSGWYLGTLDF VK3-15 (98.21) + - RBD 
$224 VH1-18 (90.97) ARQSHSTRGGWHFSP VK1-39 (95.70) + - RBD 
$225 VH3-9 (96.18) AKDISLVFWSVNPPRNGMDV VK1-39 (98.57) + - RBD 
$226 VH3-30 (89.61) ARDSSWQSTGWPINWFDR VK3-11 (96.11) + - RBD 
$227 VH3-23 (95.14) ASPLRNYGDLLY VK1-5 (96.06) + - RBD 
$228 VH3-30 (96.53) ARDLQMRVVVVSNFDY VK2D-30 (99.32) + - RBD 
$230 VH3-30 (90.97) VTQRDNSRDYFPHYFHDMDV VK2-30 (97.62) + - RBD 
$231 VH3-30 (90.62) ARDDNLDRHWPLRLGGY VK2-30 (94.56) + - RBD 
$237 VH3-21 (96.53) ARGFERYYFDS VL1-44 (96.84) + - RBD 
$309 VH1-18 (97.22) ARDYTRGAWFGESLIGGFDN VK3-20 (97.52) + + RBD 
$315 VH3-7 (97.92) ARDLWWNDQAHYYGMDV VL3-25 (97.57) + + RBD 
$303 VH3-23 (90.28) ARERDDIFPMGLNAFDI VK1-5 (97.49) + + RBD 
S304 VH3-13 (97.89) ARGDSSGYYYYFDY VK1-39 (93.55) + + RBD 
S306 VH1-18 (95.49) ASDYFDSSGYYHSFDY VK3-11 (98.92) + + Non-RBD 
$310 VH1-69 (92.71) ATRTYDSSGYRPYYYGLDV VL2-23 (97.57) + + Non-RBD 


VH and VL per cent identity refers to V gene segment identity compared to germline (as per the International Immunogenetics Information System (http://www.imgt.org/)). 


from the memory B cells of individuals infected with SARS-CoV” or 
Middle East respiratory syndrome coronavirus (MERS-CoV)” have 
previously been isolated. Passive transfer of these mAbs protected 
mice challenged with various SARS-CoV isolates and SARS-related 
coronaviruses”*”°”’, as well as with MERS-CoV”. Structural charac- 
terization of two of these mAbs in complex with the S glycoprotein of 
SARS-CoV or MERS-CoV provided molecular-level information onthe 
mechanisms of viral neutralization”. In particular, although both mAbs 
blocked S® attachment to the host receptor, the S230 mAb (which neu- 
tralizes SARS-CoV) acted by functionally mimicking attachment to the 
receptor and promoting fusogenic conformational rearrangements of 
the S glycoprotein“. Another mechanism of SARS-CoV neutralization 
has recently been described for mAb CR3022, which bound a cryptic 
epitope that is only accessible when at least two out of the three S® 
domains of aS-glycoprotein trimer were in the open conformation*®”. 
However, none of these mAbs neutralizes SARS-CoV-2. A mAb termed 
47D11 that neutralizes SARS-CoV and SARS-CoV-2 was also recently 
isolated from human-immunoglobulin transgenic mice”, and several 
mAbs have been isolated from individuals infected with SARS-CoV-2”. 


Identifying a SARS-CoV-2 cross-neutralizing mAb 


A set of human neutralizing mAbs (from an individual infected with 
SARS-CoV in 2003) that potently inhibit both human and zoonotic 
SARS-CoVisolates has previously been identified””*””. To characterize 
the potential cross-reactivity of these antibodies with SARS-CoV-2, we 
performed amemory B cell screening using peripheral blood mononu- 
clear cells collected in 2013 from the same patient. Here we describe 
19 mAbs from the initial screen (2004 blood draw)**”° and 6 mAbs 
from the new screen (2013 blood draw). The mAbs that we identified 


had a broad usage of V gene segments, and were not clonally related 
(Table 1). Eight out of the twenty-five mAbs bound to CHO cells that 
express SARS-CoV-2S glycoprotein or SARS-CoV S glycoprotein, with 
half-maximal effective concentration values that ranged between 1.4 
and 6,100 ng mI’, and 0.8 and 254 ng mI‘, respectively (Fig. 1a, b). 
We further evaluated the mAbs for binding to the SARS-CoV-2 and 
SARS-CoV S® domains, as well as to the prefusion-stabilized ectodo- 
main trimers of human coronavirus HCoV-OC43™, MERS-CoV**, 
SARS-CoV® and SARS-CoV-2° S glycoproteins. None of the mAbs that 
we studied bound to prefusion ectodomain trimers of the HCoV-OC43 
or MERS-CoVS glycoproteins, which indicated a lack of cross-reactivity 
outside the Sarbecovirus subgenus (Extended Data Fig. 1). The mAbs 
$303, S304, S309 and S315 bound SARS-CoV-2 and SARS-CoV RBDs with 
nano- to sub-picomolar affinity (Extended Data Fig. 2). In particular, 
the S309 IgG bound to the immobilized SARS-CoV-2 S° domain and to 
the ectodomain trimer of the S glycoprotein with sub-picomolar and 
picomolar avidities, respectively (Fig. 1c). The S309 Fab bound with 
nanomolar to sub-nanomolar affinities to both molecules (Fig. 1d). 
$306 and S310 stained cells that express SARS-CoV-2 S glycoprotein 
at higher levels than cells that express SARS-CoV S glycoprotein, yet 
these mAbs did not interact with ectodomain trimers and RBD con- 
structs of SARS-CoV-2 or SARS-CoV S glycoprotein by enzyme-linked 
immunosorbent assay. These results suggest that S306 and S310 may 
recognize post-fusion SARS-CoV-2S glycoprotein, which has recently 
been proposed to be abundant on the surface of authentic SARS-CoV-2 
viruses” (Fig. la, b, Extended Data Fig. 3). 

To evaluate the neutralization potency of the SARS-CoV-2 
cross-reactive mAbs, we carried out pseudovirus neutralization 
assays using a murine leukaemia virus (MLV) pseudotyping system”. 
$309 showed comparable neutralization potencies against both 
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Fig. 1| Identification of a potent SARS-CoV-2 neutralizing mAb from an 
individual infected with SARS-CoV. a, b, Binding of a panel of mAbs, isolated 
froma patient immune to SARS-CoV, to SARS-CoV-2 (a) or SARS-CoV (b) 

S glycoproteins expressed at the surface of expiCHO cells (symbols are means 
of duplicates from one experiment). c, d, Avidity and affinity measurement of 
$309 IgGI1 (c) and Fab (d) for binding to immobilized SARS-CoV-2 S? domain 
(RBD) and to the prefusion ectodomain trimer of S glycoprotein, measured 
using biolayer interferometry. e, Neutralization of SARS-CoV-2-MLV, 
SARS-CoV-MLV (bearing S glycoprotein from various isolates) and the 
SARS-related coronavirus WIV-1 by mAb S309. Mean +s.d. of triplicates is 
shown for all pseudoviruses, except for SARS-CoV-2 (mean of duplicates). 

f, Neutralization of authentic SARS-CoV-2 (strain n-CoV/USA_WA1/2020) by 
mAbsas measured by a focus-forming assay on Vero Eé cells. For the cocktail of 
$309 and S304, the concentration of S309 is as indicated in the x axis. S304 was 
added at aconstant amount of 20 pg mI”. Mean +s.d. of quadruplicates is 
shown. Ina, b, all mAbs in the same experiment were tested once. Individual 
mAbs were tested independently with similar results.In c-f, one representative 
out of two experiments with similar results is shown. 


SARS-CoV and SARS-CoV-2 pseudoviruses, whereas $303 neutralized 
SARS-CoV-MLV but not SARS-CoV-2-MLV. S304 and S315 weakly neu- 
tralized SARS-CoV-MLV and SARS-CoV-2-MLV (Extended Data Fig. 4). 
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Fig. 2|Cryo-EM structures of the SARS-CoV-2S glycoprotein in complex 
with the $309 neutralizing mAb Fab fragment. a, Ribbon diagram of the 
partially open SARS-CoV-2 S-glycoprotein trimer (one S® domainis open) 
bound to three S309 Fabs. The Fab bound to the open B domain is included only 
for visualization, and was omitted from the final model. b,c, Ribbon diagrams 
in two orthogonal orientations of the closed SARS-CoV-2 S-glycoprotein trimer 
bound to three S309 Fabs. d, Close-up view of the S309 epitope, showing the 
contacts formed with the core fucose (labelled witha star) and the rest of the 
glycanat position N343. e, Close-up view of the S309 epitope, showing the 
20-residue-long CDRH3 sitting atop the S® helix that comprises residues 
337-344. The oligosaccharide at position N343 is omitted for clarity. Ina-e, 
only the Fab variable domains are shown. Ind, e, selected residues involved 

in interactions between S309 and SARS-CoV-2S glycoprotein are shown. 

f, Molecular surface representation of the SARS-CoV-2 S-glycoprotein trimer, 
showing the S309 epitope on one protomer coloured by residue conservation 
among SARS-CoV-2 and SARS-CoV S glycoproteins. The other two protomers 
are coloured pink and gold. 


In addition, S309 neutralized SARS-CoV-MLVs from isolates of the 
3 phases of the 2002-2003 epidemic with half-maximal inhibitory 
concentration (IC;,) values of between 120 and 180 ng mI, and par- 
tially neutralized the SARS-related coronavirus** WIV-1 (Fig. le). Finally, 
mAb S309 potently neutralized authentic SARS-CoV-2 (2019n-CoV/ 
USA_WA1/2020) with an IC,, of 79 ng ml” (Fig. If). 


Structural basis of $309 cross-neutralization 


To study the mechanisms of $309-mediated neutralization, we charac- 
terized the complex between the S309 Fab fragment and a prefusion 
stabilized ectodomain trimer of SARS-CoV-2 S glycoprotein® using 
single-particle cryo-EM. Similar to a previous study of apo SARS-CoV-2 
S glycoprotein’, 3D classification of the cryo-EM data enabled identifi- 
cation of two structural states: a trimer with one S®? domain open, and 
aclosed trimer. We determined 3D reconstructions at 3.7 Aand3.1A 
resolution, respectively, of the ectodomain trimer of the SARS-CoV-2 
S glycoprotein witha single open S® domain and inaclosed state (apply- 
ing three-fold symmetry), both with three S309 Fabs bound (Fig. 2a- 
c, Extended Data Fig. 5a-f). In parallel, we also determined a crystal 
structure of the S309 Fab at 3.3 A resolution to assist model building 
(Extended Data Fig. 5g). The S309 Fab bound to the open S® domainis 
weakly resolved in the cryo-EM map, owing to marked conformational 
variability of the upward pointing S? domain, and was not modelled 
in density. The analysis below is based on the closed-state structure. 
$309 recognizes a proteoglycan epitope on the SARS-CoV-2 S°, dis- 
tinct from the receptor-binding motif. The epitope is accessible in both 
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Fig. 3 |Mechanism of $309 neutralization. a, Ribbon diagrams of S309 and 
ACE2 bound to SARS-CoV-2 S®. This composite model was generated using the 
SARS-CoV-2 S glycoprotein-S309 cryo-EM structure reported here, anda 
previously published crystal structure of SARS-CoV-2 S? bound to ACE2”°. Only 
the Fab variable domains are shown. b, Competition of S309 or S230 mAbs with 
ACE2 to bind to SARS-CoVS? (left) and SARS-CoV-2 S® (right). The vertical 
dashed line indicates the start of the association of mAb-complexed or free S® 
to solid-phase ACE2. Each mAb was tested inat least two experiments with 
similar results. c, Neutralization of SARS-CoV-MLV by S309 recombinant (r) 
IgG1 or S309 Fab, plotted innM. Mean of duplicates is shown; the experiment 
was repeated with similar results. d, mAb-mediated ADCC using primary 
natural killer effector cells and SARS-CoV-2 S-glycoprotein-expressing 
expiCHO as target cells. Bars show the average area under the curve (AUC) for 
the responses of 3 (VV) (left) or 4 (FF-FV) (right) donors genotyped for their 
FcyRIII. Mean +s.d., data were pooled from two independent experiments. 

e, Activation of high-affinity (V158) (left) or low-affinity (F158) (right) FeyRIIla 
was measured using Jurkat reporter cells and SARS-CoV-2 S-glycoprotein- 
expressing expiCHO as target cells. RLU, relative luminescence unit. 

f, mAb-mediated ADCP using Cell-Trace-Violet-labelled peripheral blood 
mononuclear cells as phagocytic cells, and PKF67-labelled SARS-CoV-2 
S-glycoprotein-expressing expiCHO as target cells. Bars show the AUC for the 
responses of four donors. Mean +s.d., data were pooled from two independent 
experiments. g, Activation of FcyRIla (131H) measured using Jurkat reporter 
cells and SARS-CoV-2 S-glycoprotein-expressing expiCHO as target cells. Ine 
and g, one experiment, symbols show means of duplicates per mAb dilution 
except for S304, S230 and $315 (one data point per dilution). S309 was retested 
in an independent experiment with similar results. 


the open and closed states of the S glycoprotein, which explains the stoi- 
chiometric binding of Fab to the trimer of the S glycoprotein (Fig. 2a—c). 
The S309 paratope is composed of all 6 complementarity-determining 


region (CDR) loops, which bury a surface area of about 1,150 A’ at the 
interface with S® through electrostatic interactions and hydrophobic 
contacts. The 20-residue-long CDRH3 sits atop the S® helix that com- 
prises residues 337-344, and also contacts the edge of the S® 5-stranded 
B-sheet (residues 356-361), overall accounting for about 50% of the 
buried surface area (Fig. 2d, e). CDRL1and CDRL2 extend the epitope by 
interacting with the helix that spans residues 440-444, whichis located 
near the three-fold molecular axis of the S glycoprotein. CDRH3 and 
CDRL2 sandwich the glycan of the SARS-CoV-2 S glycoprotein at posi- 
tion N343, through contacts with the core fucose moiety (consistent 
witha previous study that detected SARS-CoV-2 N343 core-fucosylated 
peptides by mass spectrometry”) and toa lesser extent with the other 
saccharides within the glycan chain (Fig. 2d). These interactions 
between $309 and the glycan bury an average surface of about 300 A” 
and stabilize the N343 oligosaccharide, which is resolved to amuch 
larger extent than in structures of the apo SARS-CoV-2S glycoprotein®. 

The structural data explain the S309 cross-reactivity between 
SARS-CoV-2 and SARS-CoV, as 17 out of 22 residues of the epitope 
are strictly conserved (Fig. 2f, Extended Data Fig. 6a, b). R346, N354, 
R357 and L441 of SARS-CoV-2 are conservatively substituted for K333, 
E341, K344 (except for SARS-CoV isolate GZO2, in which this is R444) 
and 1428, respectively, of SARS-CoV, and the K444 of SARS-CoV-2 is 
semi-conservatively substituted for T431 of SARS-CoV, in agreement 
with the comparable binding affinities of S309 to SARS-CoV and 
SARS-CoV-2 S glycoprotein (Fig. 1c). The oligosaccharide at position 
N343 is also conserved in both viruses and corresponds to SARS-CoV 
N330, for which core-fucosylated glycopeptides were previously 
detected by mass spectrometry“ that would allow for similar interac- 
tions with the S309 Fab. Analysis of the S glycoprotein sequences of the 
11,839 SARS-CoV-2 isolates reported to date indicates that the epitope 
residues are conserved in all but 4 isolates, for which we found N354D 
or the S359N substitutions that are not expected to affect recognition 
by S309 (Extended Data Fig. 7a, b). Furthermore, $309 contact residues 
are highly conserved across human and animal isolates of clade 1, 2 
and 3 sarbecoviruses”’ (Extended Data Fig. 7c). Collectively, our data 
suggest that S309 could neutralize potentially all SARS-CoV-2 isolates 
known to be circulating to date, and possibly many other zoonotic 
sarbecoviruses. The degree of conservation is consistent with the mod- 
erate rates of evolution of SARS-CoV-2, estimated at about 1.8 x 10° 
substitutions per site per year*’. On the basis of more than 10° viral 
sequences analysed to date, an estimated 112 residues are under positive 
selection (8 in the S glycoprotein) and 18 are under negative selection 
(Lin the S glycoprotein) in a genome of nearly 30 kb*. These obser- 
vations are consistent with the fact that Coronaviridae is a family of 
RNA viruses with unusually high replication fidelity required by their 
exceptionally large genomes”. 


Mechanism of $309-mediated neutralization 

The cryo-EM structure of S309 bound to SARS-CoV-2 S glycoprotein 
presented here, combined with the structures of SARS-CoV-2 S® and 
SARS-CoV S? in complex with ACE2, indicate!**°* that the Fab engages 
anepitope distinct from the receptor-binding motifand would not clash 
with ACE2 upon binding to S glycoprotein (Fig. 3a). Biolayer interferom- 
etry analysis of S309 Fab or IgG binding to the SARS-CoV-2 S® domain 
or the ectodomain trimer of S glycoprotein confirmed the absence of 
competition between S309 and ACE2 for binding to the SARS-CoV-2 
S glycoprotein (Fig. 3b, Extended Data Fig. 8). 

To further investigate the mechanism of S309-mediated neutraliza- 
tion, we compared side-by-side infection of SARS-CoV-2-MLV in the 
presence of either S309 Fab or S309 IgG. Both experiments yielded 
comparable IC,, values (3.8 and 3.5 nM, respectively), indicating similar 
potencies for IgG and Fab (Fig. 3c). However, S309 IgG-mediated neu- 
tralization reached 100%, whereas neutralization plateaued at about 
80% in the presence of S309 Fab (Fig. 3c). This result indicates that one 
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Fig. 4|mAb cocktails enhance SARS-CoV-2 neutralization. a, Heat map 
showing the competition of mAb pairs for binding to the SARS-CoV S’ domain 
as measured by biolayer interferometry (as shown in Extended Data Fig. 10). 

b, Competition of mAb pairs for binding to the SARS-CoV-2 S’ domain. Each 
competition measured once. Ina, b, acoloured box indicates competition and 


or more IgG-specific bivalent mechanisms—such as S-glycoprotein 
trimer cross-linking, steric hindrance or aggregation of virions“—may 
contribute to the ability of S309 to fully neutralize pseudovirions. 

Fe-dependent effector mechanisms, such as antibody-dependent 
cell cytotoxicity (ADCC) mediated by natural killer cells, can 
contribute to viral control in individuals infected with virus. We 
observed efficient S309- and S306-mediated ADCC of SARS-CoV-2 
S-glycoprotein-transfected cells, whereas the other mAbs that we tested 
showed limited or no activity (Fig. 3d, Extended Data Fig. 9a). These 
findings might be related to distinct binding orientations and/or posi- 
tioning of the mAb Fc fragment relative to the FcyRIlla receptors. ADCC 
was observed only using natural killer (effector) cells that express the 
high-affinity FcyRIlla variant (V158) but not the low-affinity variant 
(F158) (Fig. 3d). These results, which we confirmed using a FcyRIlla cell 
reporter assay (Fig. 3e), suggest that S309 Fc engineering could poten- 
tially enhance the activation of natural killer cells with the low-affinity 
FcyRilla variant (F158)*°. Antibody-dependent cellular phagocytosis 
(ADCP), mediated by macrophages or dendritic cells, can contribute 
to viral control by clearing virus and infected cells and by stimulating 
aT cell response via presentation of viral antigens**””. Similar to the 
ADCC results, the mAbs S309 and S306 showed the strongest ADCP 
response (Fig. 3f, Extended Data Fig. 9b). However, FcyRla signalling 
was observed only for S309 (Fig. 3g). These findings suggest that ADCP 
by monocytes was dependent on engagement of both FcyRIlla and 
FcyRlla. Collectively, these results demonstrate that—in addition to 
potent in vitro neutralization—S309 may leverage additional protective 
mechanisms in vivo, as has previously been shown for other antivi- 
ral antibodies**”’. Although the risks of antibody-dependent disease 
enhancement will need to be evaluated for SARS-CoV-2, potent virus 
neutralization by a specific monoclonal antibody or by antibody cock- 
tails is expected to limit this possibility, compared to weakly neutral- 
izing antibodies that might potentially be induced upon vaccination 
or infection®. Furthermore, neutralizing antibodies are associated 
with reduced susceptibility to re-infection or disease in humans. 
Regardless, the possibility of antibody-dependent disease enhance- 
ment will need to be assessed during clinical trials for either antibodies 
or vaccines. 


Enhancing SARS-CoV-2 neutralization 


To gain more insight into the epitopes recognized by our panel of 
mAbs, we used structural information, escape mutant analysis”? and 
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each colour corresponds to an antigenic site. c,d, Neutralization of 
SARS-CoV-2-MLV by S309 combined with an equimolar amount of $304 or S315 
mAb. For mAb cocktails, the concentration on the x axis is that of the total 
amount of mAb per dilution. Mean of duplicates is shown. Both experiments 
were repeated with similar results. 


biolayer-interferometry-based epitope binning to map the antigenic 
sites that are present on the SARS-CoV and SARS-CoV-2 S® domains 
(Fig. 4a, Extended Data Fig. 10). This analysis identified at least four 
antigenic sites within the S® domain of SARS-CoV targeted by our panel 
of mAbs. The receptor-binding motif, which is targeted by S230, S227 
and S110, is termed site I. Sites II and III are defined by S315 and S124, 
respectively, and the two sites were bridged by mAb S304. Site IV is 
defined by the S309, S109 and S303 mAbs. Given the lower number 
of mAbs that cross-react with SARS-CoV-2, we were able to identify 
site IV targeted by S309 and S303, and sites Il and III targeted by S304 
and S315 (Fig. 4b). 

On the basis of these findings, we evaluated the neutralization 
potency of the site-IV S309 mAb in combination with either the 
site-I1 S315 mAb or the site-II and site-III S304 mAb. Although S304 
and S315 alone were weakly neutralizing, the combination of either 
of these mAbs with S309 resulted in an enhanced neutralization 
potency, compared to single mAbs, against both SARS-CoV-2-MLV 
and authentic SARS-CoV-2 (Figs. 1f, 4c, d). Asynergistic effect between 
two non-competing anti-RBD mAbs has previously been reported for 
SARS-CoV™ and our data extend this observation to SARS-CoV-2, pro- 
viding a proof-of-concept for the use of mAb combinations to prevent 
or control SARS-CoV-2. 

In summary, our study identifies S309 as a human mAb that has 
broad neutralizing activity against multiple sarbecoviruses (includ- 
ing SARS-CoV-2), via recognition of a highly conserved epitope in the 
S® domain that comprises the N343 glycan (N330 in SARS-CoVS gly- 
coprotein). Furthermore, S309 can recruit effector mechanisms and 
showed increased neutralization in combination with weakly neutraliz- 
ing mAbs, which may mitigate the risk of viral escape. Our data indicate 
the potential to discover potently neutralizing pan-sarbecovirus mAbs, 
define antigenic sites to include in vaccine design and pave the way 
to support preparedness for future outbreaks of sarbecoviruses. As 
S309 shows promise as an effective countermeasure to the COVID-19 
pandemic caused by SARS-CoV-2, Fc variants of S309 with increased 
half-life and effector functions have entered an accelerated develop- 
ment path towards clinical trials. 
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Methods 


No statistical methods were used to predetermine sample size. The 
experiments were not randomized and investigators were not blinded 
to allocation during experiments and outcome assessment. 


Celllines 

Cell lines used in this study were obtained from ATCC (HEK293T, 
Vero-E6, Vero CCL81 cells) or Invitrogen (Expi-CHO cells). All cell lines 
used in this study were routinely tested for mycoplasma and found to 
be mycoplasma-free. 


Ethics statement 

Donors provided written informed consent for the use of blood and 
blood components (such as sera), following approval by the Canton 
Ticino Ethics Committee (Switzerland). 


Antibody discovery and expression 

mAbs were isolated from Epstein-Barr-virus-immortalized memory B 
cells. Recombinant antibodies were expressed in expiCHO cells tran- 
siently cotransfected with plasmids expressing the heavy and light 
chain, as previously described®. The mAbs $303, $304, $306, S309, 
$310 and S315 were expressed as recombinant IgG-LS antibodies. The 
LS mutation confers alonger half-life in vivo®’. Antibodies S110 and S124 
tested in Fig. 1 and Extended Data Fig. 1 were purified mAbs produced 
from immortalized B cells. 


Transient expression of recombinant SARS-CoV-2 protein and 
flow cytometry 

The full-length S gene of SARS-CoV-2 strain (SARS-CoV-2-S) isolate 
BetaCoV/Wuhan-Hu-1/2019 (accession number MN908947) was 
codon-optimized for human cell expression and cloned into the 
phCMV1 expression vector (Genlantis). Expi-CHO cells were transiently 
transfected with phCMV1-SARS-CoV-2-S, SARS-spike_pcDNA-3.1 (strain 
SARS) or empty phCMV1 (mock) using Expifectamine CHO Enhancer. 
Two days after transfection, cells were collected for immunostaining 
with mAbs. An Alexa-647-labelled secondary antibody anti-human IgG 
Fc was used for detection. Binding of mAbs to transfected cells was ana- 
lysed by flow cytometry using a ZES Cell Analyzer (Biorard) and FlowJo 
software (TreeStar). Positive binding was defined by differential stain- 
ing of CoV-S-glycoprotein transfectants versus mock transfectants. 


Affinity and avidity determination and competition 
experiments using Octet (biolayer interferometry) 

For affinity and avidity determination of lgG1 compared to Fab fragment, 
biotinylated RBD of SARS-CoV-2 (produced in house; residues 331-550 
ofS glycoprotein from BetaCoV/Wuhan-Hu-1/2019, accession number 
MN908947, biotinylated with EZ-Link NHS-PEG,-biotin from Ther- 
moFisher) and biotinylated SARS-CoV-2 2P S glycoprotein avi-tagged 
were loaded at 7.5 pg/ml in kinetics buffer (0.01% endotoxin-free BSA, 
0.002% Tween-20, 0.005% NaN3 in PBS) for 8 min onto streptavidin 
biosensors (Molecular Devices, ForteBio). Association of IgGl and Fab 
was performed in kinetics buffer at 100, 33, 11, 3.6, 1.2 nM for 5 min. 
Dissociation in kinetics buffer was measured for 10 min. K, values were 
calculated using a 1:1 global fit model (Octet). 

Alternatively, measurement of apparent K, for IgGs was determined 
using protein A biosensors (Pall ForteBio) that were loaded with dif- 
ferent mAbs at 2.7 pg/ml for 1 min, after a hydration step for 10 minin 
kinetics buffer. Association curves were recorded for 5 min by incubat- 
ing the mAb-coated sensors with different concentrations of SARS-CoV 
RBD (Sino Biological) or SARS-CoV-2 RBD (produced in house; residues 
331-550 of S glycoprotein from BetaCoV/Wuhan-Hu-1/2019, accession 
number MN908947). The highest RBD concentration was 10 pg/ml, 
then serially diluted 1:2.5. Dissociation was recorded for 9 min by 
moving the sensors to wells containing kinetics buffer. K, values were 


calculated using a global fit model (Octet). Octet Red96 (ForteBio) 
equipment was used. 

For mAb competition experiments, His-tagged RBD of SARS-CoV 
or SARS-CoV-2 was loaded for 5 min at 3 pg/ml in kinetics buffer onto 
anti-Penta-HIS (HIS1K) biosensors (Molecular Devices, ForteBio). Asso- 
ciation of mAbs was performed in kinetics buffer at 15 pg/ml. 

For ACE2 competition experiments, ACE2-His (Bio-Techne AG) was 
loaded for 30 min at 5 pg/ml in kinetics buffer onto anti-HIS (HIS2) 
biosensors (Molecular Devices-ForteBio). 

SARS-CoV RBD-rabbit Fc or SARS-CoV-2 RBD-mouse Fc (Sino Bio- 
logical Europe GmbH) at 1 pg/ml was associated for 15 min, after pre- 
incubation with or without antibody (30 pg/ml, 30 min). Dissociation 
was monitored for 5 min. 


Enzyme-linked immunosorbent assay 

The following proteins were coated on 96-well enzyme-linked immu- 
nosorbent assay plates at the following concentrations: SARS-CoV RBD 
(Sino Biological, 40150-VO8BI1) at 1 pg/ml, SARS-CoV-2 RBD (produced 
in house) at 10 pg/ml, ectodomains (stabilized prefusion trimer) of 
SARS-CoV, SARS-CoV-2, HCoV-OC43 and MERS-CoV, all at 1 pg/ml. After 
blocking with 1% BSA in PBS, antibodies were added to the plates at 
concentrations between 5 and 0.000028 pg/ml and incubated for 1h 
at room temperature. Plates were washed and secondary antibody Goat 
Anti Human IgG-AP (Southern Biotechnology: 2040-04) was added. 
Substrate p-nitrophenyl pPhosphate (pNPP) (Sigma-Aldrich 71768) 
was used for colour development. Optical density at 405 nm was read 
onan ELx808lU plate reader (Biotek). 


Measurement of Fc-mediated effector functions 

ADCC assays were performed using expiCHO cells transiently trans- 
fected with SARS-CoV or SARS-CoV-2S glycoprotein as targets. Natural 
killer cells were isolated from fresh blood of healthy donors using the 
MACSxpress NK Isolation Kit (Miltenyi Biotec, cat. no. 130-098-185). 
Target cells were incubated with titrated concentrations of mAbs for 
10 minand were then incubated with primary human natural killer cells 
as effector cells at an effector:target ratio of 9:1. ADCC was measured 
using LDH release assay (Cytotoxicity Detection Kit (LDH) (Roche; cat. 
no. 11644793001)) after 4 h incubation at 37 °C. 

ADCP assays were performed using expiCHO target cells transiently 
transfected with SARS-CoV-2S glycoprotein and fluorescently labelled 
with PKH67 Fluorescent Cell Linker Kits (Sigma Aldrich, cat. no. MINI67). 
Target cells were incubated with titrated concentrations of mAbs for 10 
min, followed by incubation with human peripheral blood mononuclear 
cells isolated from healthy donors and fluorescently labelled with Cell 
Trace Violet (Invitrogen, cat. no. C34557). An effector:target ratio of 
20:1 was used. After an overnight incubation at 37 °C, cells were stained 
with anti-human CD14-APC antibody (BD Pharmingen, cat. no. 561708, 
Clone M5E2) to stain monocytes. Antibody-mediated phagocytosis 
was determined by flow cytometry, gating on CD14" cells that were 
double-positive for cell trace violet and PKH67. 

Determination of mAb-dependent activation of human FcyRIlla or 
FcyRIla was performed using expiCHO cells transiently transfected with 
SARS-CoV-2S glycoprotein (BetaCoV/Wuhan-Hu-1/2019). Cells were incu- 
bated with titrated concentrations of mAbs for 10 min before incubation 
with FcyRIlla-receptor- or FcyRlla-expressing Jurkat cells stably transfected 
with NFAT-driven luciferase gene (Promega, cat. no. G9798 and G7018). 
Aneffector-to-target ratio of 6:1 for FcyRIlla and 5:1 for FcyRila was used. 
Activation of human FcyR in this bioassay results in the NFAT-mediated 
expression of aluciferase reporter gene. Luminescence was measured after 
21h of incubation at 37 °C with 5% CO, using the Bio-Glo-TM Luciferase 
Assay Reagent according to the manufacturer's instructions. 


Pseudovirus neutralization assays 
MLV-based SARS-CoV S-glycoprotein-pseudotyped viruses 
were prepared as previously described®*’. HEK293T cells were 


cotransfected with a SARS-CoV, SARS-CoV-2, CUHK, GZO2 or WiV1 
S-glycoprotien-encoding-plasmid, an MLV Gag-Pol packaging construct 
and the MLV transfer vector encoding a luciferase reporter using the 
Lipofectamine 2000 transfection reagent (Life Technologies) accord- 
ing to the manufacturer’s instructions. Cells were incubated for 5h at 
37 °C with 8% CO, with OPTIMEM transfection medium. DMEM contain- 
ing 10% FBS was added for 72h. 

VeroE6 cells or DBT cells transfected with human ACE2 were cul- 
tured in DMEM containing 10% FBS, 1% penicillin-streptomycin and 
plated into 96-well plates for 16-24 h. Concentrated pseudovirus with 
or without serial dilution of antibodies was incubated for 1h and then 
added to the wells after washing 3 with DMEM. After 2-3 h DMEM 
containing 20% FBS and 2% penicillin-streptomycin was added to the 
cells for 48 h. Following 48 h of infection, One-Glo-EX (Promega) was 
added tothe cells and incubated in the dark for 5-10 min before read- 
ing ona Varioskan LUX plate reader (ThermoFisher). Measurements 
were done in duplicate and relative luciferase units were converted to 
per cent neutralization and plotted witha nonlinear regression curve 
fit in PRISM. 


Live virus neutralization assay 

SARS-CoV-2 strain 2019-nCoV/USA_WA1/2020 was obtained from the 
Centers for Disease Control and Prevention (gift of N. Thornburg). 
Virus was passaged once in Vero CCL8I1 cells (ATCC) and titrated by 
focus-forming assay on Vero E6 cells. Serial dilutions of the indicated 
mAbs were incubated with 10? focus-forming units of SARS-CoV-2 for 1 
hat 37 °C. Mab-virus complexes were added to Vero E6 cell monolayers 
in 96-well plates and incubated at 37 °C for 1h. Subsequently, cells were 
overlaid with 1% (w/v) methylcellulose in MEM supplemented with 2% 
FBS. Plates were collected 30 h later by removing overlays and fixed with 
4% PFA in PBS for 20 min at room temperature. Plates were washed and 
sequentially incubated with 1 pg/ml of CR3022” anti-S-glycoprotein 
antibody and HRP-conjugated goat anti-human IgG in PBS supple- 
mented with 0.1% saponin and 0.1% BSA. SARS-CoV-2-infected cell foci 
were visualized using TrueBlue peroxidase substrate (KPL) and quanti- 
fied on an ImmunoSpot microanalyzer (Cellular Technologies). Data 
were processed using Prism software (GraphPad Prism 8.0). 


Recombinant S-glycoprotein ectodomain and S° production 
The SARS-CoV-2 2P S glycoprotein (GenBank: YP_009724390.1) ecto- 
domain was produced in 500-ml cultures of HEK293F cells grown in 
suspension using FreeStyle 293 expression medium (Life technolo- 
gies) at 37 °Cina humidified 8% CO, incubator rotating at 130 r.p.m., 
as previously reported’. The culture was transfected using 293fectin 
(ThermoFisher Scientific) with cells grown toa density of 10° cells per 
mland cultivated for 3 d. The supernatant was collected and cells were 
resuspended for another three days, yielding two collections. Clari- 
fied supernatants were purified using a 5-ml Cobalt affinity column 
(Takara). Purified protein was concentrated and flash-frozen ina buffer 
containing 20 mM Tris pH 8.0 and 150 mM NaCl before cryo-EM analy- 
sis. The SARS-CoV-2 2P S-glycoprotein-avi, SARS-CoV S glycoprotein, 
HCoV-0C43 S glycoprotein and MERS-CoVS glycoprotein constructs 
have previously been described’*” and were produced similarly to 
SARS-CoV-2 2P S glycoprotein. 


Cryo-EM sample preparation and data collection 

Three microlitres of SARS-CoV-2S glycoprotein at 1.6 mg/ml was mixed 
with 0.45 pl of S309 Fab (obtained by LysC fragmentation of S309 IgG) 
at 7.4 mg/ml for 1 min at room temperature before application ontoa 
freshly glow-discharged 1.2/1.3 UltraFoil grid (300 mesh). Plunge freez- 
ing used a vitrobot MarkIV (ThermoFisher Scientific) using a blot force 
of Oand6.5s blot time at 100% humidity and 25 °C. Data were acquired 
using the Leginon software’ to control an FEI Titan Krios transmission 
electron microscope operated at 300 kV and equipped with a Gatan K2 
Summit direct detector and Gatan Quantum GIF energy filter, operated 


in zero-loss mode with a slit width of 20 eV. Automated data collection 
was carried out using Leginon at a nominal magnification of 130,000x 
witha pixel size of 0.525 A with tilt angles ranging between 20° and 50°, 
as previously described. The dose rate was adjusted to 8 counts per 
pixel pers, and each movie was acquired in super-resolution mode 
fractionated in 50 frames of 200 ms. Three thousand nine hundred 
micrographs were collected in a single session with a defocus range 
of between -1.0 and -3.0 pm. 


Cryo-EM data processing 

Movie frame alignment, estimation of the microscope contrast-transfer 
function parameters, particle picking and extraction were carried out 
using Warp. Particle images were extracted with a box size of 800 
binned to 400, yielding a pixel size of 1.05 A. For each dataset, two 
rounds of reference-free 2D classification were performed using cry- 
oSPARC" to select well-defined particle images. Subsequently, two 
rounds of 3D classification with 50 iterations each (angular sampling 
7.5° for 25 iterations and 1.8° with local search for 25 iterations), using 
a previously reported closed SARS-CoV-2S glycoprotein structure’ as 
initial model, were carried out using Relion® without imposing sym- 
metry to separate distinct SARS-CoV-2S glycoprotein conformations. 
Three-dimensional refinements were carried out using non-uniform 
refinement along with per-particle defocus refinement in cryoSPARC“. 
Particle images were subjected to Bayesian polishing® before perform- 
ing another round of non-uniform refinement in cryoSPARC“, followed 
by per-particle defocus refinement and again non-uniform refine- 
ment. Reported resolutions are based on the gold-standard Fourier 
shell correlation of 0.143 criterion and Fourier shell correlation curves 
were corrected for the effects of soft masking by high-resolution noise 
substitution™. 


Cryo-EM model building and analysis 

UCSF Chimera® and Coot were used to fit atomic models (Protein Data 
Bank codes (PDB) 6VXX and PDB 6VYB) into the cryo-EM maps. The Fab 
was subsequently manually built using Coot”. N-linked glycans were 
hand-built into the density where visible, and the models were refined 
and relaxed using Rosetta°®. Glycan refinement relied on a dedicated 
Rosetta protocol, which uses physically realistic geometries based on 
prior knowledge of saccharide chemical properties”, and was aided by 
using both sharpened and unsharpened maps. Models were analysed 
using MolProbity”, EMringer”, Phenix” and privateer’ to validate the 
stereochemistry of both the protein and glycan components. Figures 
were generated using UCSF ChimeraX”. 


Crystallization and X-ray structure determination of Fab $309 
Fab S309 crystals were grown ina hanging drop set up with amosquito 
at 20 °C using 150 nl protein solution in Tris HCI pH 8.0, 150 mM NaCl 
and 150 nl mother liquor solution containing 1.1 M sodium malonate, 
0.1M HEPES, pH 7.0 and 0.5% (w/v) Jeffamine ED-2001. Crystals were 
cryo-protected using the mother liquor solution supplemented with 
30% glycerol. The dataset was collected at ALS beamline 5.0.2 and pro- 
cessed to 3.3 A resolution in space group P4,2,2 using mosflm” and 
Aimless”. The structure of Fab $309 was solved by molecular replace- 
ment using Phaser” and homology models as search models. The coor- 
dinates were improved and completed using Coot® and refined with 
REFMACS®%. Crystallographic data collection and refinement statistics 
are shown in Extended Data Fig. 5g. 


Conservation analysis 

SARS-CoV-2 genomics sequences were downloaded from GISAID on 
27 April2020 (n=11,839) using the ‘complete (>29,000 bp)’ and ‘lowcover- 
age exclusion filters. Bat and pangolin sequences were removed to yield 
human sequences only. The S glycoprotein ORF was localized by align- 
ing the reference protein sequence (YP_009724390.1) to the genomic 
sequence of isolates with Exonerate v.2.4.0 (-m protein2dna --refine 
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full--minintron 999999 --percent 30 --showalignment false --showvulgar 
false --ryo “>%ti\n%tcs). Coding nucleotide sequences were translated in 
silico using seqkit v.0.12.0. Multiple sequence alignment was performed 
using MAFFT v.7.455 (--amino—bl 80 --nomemsave --reorder --add 
spike_aa_sequences.fasta --keeplength reference_aa_sequence.fasta). 
Variants were determined by comparison of aligned sequences to the 
reference sequence using the R v3.6.3/Bioconductor v.3.10 package 
Biostrings v.2.54.0 (function: consensusMatrix). A similar strategy was 
used to extract and translate S glycoprotein sequences from SARS-CoV 
genomes sourced from ViPR (search criteria: SARS-related coronavirus, 
full-length genomes, human host, deposited before December 2019 
to exclude SARS-CoV-2, n=53, performed on 29 March 2020). We con- 
firmed that sourced SARS-CoV genome sequences comprised all the 
major published strains (such as Urbani, Tor2, TW1, P2, and Frankfurt1, 
among others). Pangolin sequences” were sourced from GISAID and bat 
sequences from the three clades of sarbecoviruses” were downloaded 
from Genbank (civet (AY304486.1) and raccoon dog (AY304487.1). Full 
conservation analysis code is available at https://github.com/virbio/ 
manuscript-cov2-pinto-conservation. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 
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Extended Data Fig. 2 | Antibody affinity and avidity of S309, S303, S304 and 
$315 tothe RBD of SARS-CoV and SARS-CoV-2. Antibodies were loaded to 
biolayer interferometry (BLI) pins via protein A for the measurement of 
association of different concentrations of the RBD of SARS-CoV-2 (blue) and 


SARS-CoV (red). Vertical dashed lines indicate the start of the dissociation 
phase, when BLI pins were switched to buffer. The experiments were done once 
for S303, S304 and S315. The experiment for S309 was repeated once. 
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Extended Data Fig. 3 | Binding of crossreactive mAbs to expiCHO cells transfected with SARS-CoV or SARS-CoV-2S glycoprotein. Mean fluorescence 
intensity as measured in flow cytometry for each antibody. Antibody concentrations tested are indicated inthex axis. 
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Extended Data Fig. 4| Neutralization of cross-reactive mAbs. a, b, The six 
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SARS-CoV-2-MLV (a) and SARS-CoV-MLV (b). Symbols are mean of duplicates. 
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correlation curves for the closed (blue) and partially open trimers (red). Fab X-ray structure. 
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of the full RBD of multiple sarbecoviruses. The ACE2 and S309 footprints are motif is shown in orange. c, The SARS-CoV and SARS-CoV-2 differences are 
highlighted in blue and magenta, respectively. Dashed boxes indicate shownin green, and differences within the S309 footprint are shown in pink 
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Extended Data Fig. 8 | Competition of antibodies with RBD binding to 
ACE2.a, Human ACE2 (hACE2) was loaded onto BLI sensors, followed by 
incubation of the sensors with RBD alone or RBD in combination with 
recombinant antibodies. The vertical dashed line indicates the start of the 
loading of RBD with or without antibody. b, SARS-CoV-2 ectodomain was 
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Extended Data Fig. 9 | ADCC and ADCP data for one representative donor. 
This figure is related to Fig. 3.a, ADCC for one donor who is homozygous for 
high-affinity variant FcyRIIa158V (VV). Background signal of cells without 
antibody was deducted fromall values before plotting. b, ADCP for one donor 
whois heterozygous for FcyRIIla158V (FV). The dashed line indicates the 
background signal for cells without antibody. Symbols are mean of duplicates. 
Cells from each donor were sufficient to conduct one experiment. 
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Extended Data Fig. 10 | Competition of mAb pairs for binding tothe RBD 
domain of SARS-CoV, as determined by BLI (Octet). RBD was loaded on BLI 
pins. Association was measured first for antibodies indicated on the left of the 
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matrix, followed by association of the antibodies indicated on top of the 
matrix. Each combination of antibodies was tested once. 
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The mammalian immune system implements a remarkably effective set of 
mechanisms for fighting pathogens’. Its main components are haematopoietic 


immune cells, including myeloid cells that control innate immunity, and lymphoid 
cells that constitute adaptive immunity”. However, immune functions are not unique 
to haematopoietic cells, and many other cell types display basic mechanisms of 
pathogen defence® >. To advance our understanding of immunology outside the 
haematopoietic system, here we systematically investigate the regulation of immune 
genes in the three major types of structural cells: epithelium, endothelium and 
fibroblasts. We characterize these cell types across twelve organs in mice, using 
cellular phenotyping, transcriptome sequencing, chromatin accessibility profiling 
and epigenome mapping. This comprehensive dataset revealed complex immune 
gene activity and regulation in structural cells. The observed patterns were highly 
organ-specific and seem to modulate the extensive interactions between structural 
cells and haematopoietic immune cells. Moreover, we identified an epigenetically 
encoded immune potential in structural cells under tissue homeostasis, which was 
triggered in response to systemic viral infection. This study highlights the prevalence 
and organ-specific complexity of immune gene activity in non-haematopoietic 
structural cells, and it provides a high-resolution, multi-omics atlas of the epigenetic 
and transcriptional networks that regulate structural cells inthe mouse. 


The structure of most tissues and organs in the mammalian body is 
shaped by epithelial cells (epithelium), which create internal and exter- 
nal surfaces and barriers; endothelial cells (endothelium), which form 
the lining of blood vessels; and fibroblasts, which provide essential 
connective tissue (stroma)®. These three cell types, which we refer to 
as structural cells, have been shown to contribute in important ways 
to mammalian immunity’ “. However, there has been little system- 
atic investigation across organs, in part because structural cells are 
difficult to study by genetic ablation owing to their essential struc- 
tural roles in most organs. Multi-omics profiling has emerged as a 
promising approach to dissecting immune regulation in asystematic, 
genome-wide manner, as illustrated by recent work on the systems 
immunology of haematopoietic immune cells”. 

In this study, we used multi-omics profiling and integrative bioin- 
formatics to establish a high-resolution atlas of structural cells and of 
non-haematopoietic immune regulation in the mouse. We observed 
widespread expression of immune regulators and cytokine signalling 
molecules in structural cells, organ-specific adaptation to the tissue 
environment, and unexpectedly diverse capabilities for interacting 
with haematopoietic cells. These cell-type-specific and organ-specific 
differences in immune gene activity were reflected by characteris- 
tic patterns of chromatin regulation. Notably, we found evidence 


of an epigenetically encoded immune potential under homeostatic 
conditions, and the affected genes were preferentially upregulated 
in response to an immunological challenge induced by systemic viral 
infection. We validated and functionally dissected this epigenetic 
potential of structural cells by further in vivo experiments with recom- 
binant cytokines. 

In summary, our study uncovered widespread immune gene reg- 
ulation in structural cells of the mouse, and it established a multi- 
organ atlas of the underlying epigenetic and transcription-regulatory 
programs. 


Mapping structural cells across organs 


To investigate the regulation of immune genes in structural cells, 
we performed multi-omics profiling of endothelium, epithelium 
and fibroblasts from 12 mouse organs (brain, caecum, heart, kid- 
ney, large intestine, liver, lung, lymph node, skin, small intestine, 
spleen and thymus). Single-cell suspensions were analysed by flow 
cytometry, and sort-purified cell populations were profiled with 
three genome-wide assays (Fig. 1a): (i) gene expression profiling by 
low-input RNA sequencing (RNA-seq)"; (ii) chromatin accessibility 
profiling with the assay for transposase-accessible chromatin using 
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Fig. 1|Multi-omics profiling establishes cell-type-specific and 
organ-specific characteristics of structural cells. a, Schematic outline of the 
experimental approach. b, Relative frequencies of structural cell types based 
onflowcytometry.c, Expression of surface markers among structural cells, 
comparing the standardized sorting of endothelium, epithelium and 
fibroblasts to potential alternative markers (left: schematic outline; centre: 


sequencing (ATAC-seq)”; and (iii) epigenome profiling by ChIPmen- 
tation’® with an antibody against the promoter and enhancer-linked 
histone H3K4me2 mark”. All assays produced high-quality data (Sup- 
plementary Table 1). The full dataset is provided as an online resource 
for interactive browsing and download at http://structural-immunity. 
computational-epigenetics.org. 

To maximize comparability across organs, we developed a standard- 
ized workflow for tissue dissociation and cell purification (Fig. 1a). 
Structural cells were purified with an organ-independent sorting 
scheme that comprised the endothelium marker CD31 (encoded by 
Pecam1), the epithelium marker EpDCAM (Epcam) and the fibroblast 
marker GP38 (encoded by Pdpn, also known as podoplanin) (Extended 
Data Fig. la, b). All three types of structural cells were detectable in all 
12 organs, with strong differences in their relative frequencies (Fig. 1b, 
Extended Data Fig. Ic). Our standardized tissue dissociation did not 
cause major technical biases (Extended Data Fig. 1d). 

We validated and phenotypically characterized structural cell popu- 
lations by flow cytometry for additional markers (Fig. 1c, Extended 
Data Fig. 2). Purification of CD31°GP38" endothelial cells specifically 
enriched for blood endothelium while excluding the less prevalent 
CD31°GP38*' cells of lymphatic endothelium, which facilitates the com- 
parison across organs. Alternative endothelial markers such as CD144 
(VE-cadherin), MAdCAM1 and VCAM1 (used to assess baseline activa- 
tion of endothelium under homeostatic conditions) did not improve 
the identification of blood endothelial cells in our analysis. Similarly, 
sorting of fibroblasts as GP38*CD31 cells did not miss any major cell 
populations identified by the alternative markers CD90.2 (Thy-1.2), 
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heat maps showing marker overlap; right: illustrative FACS plots). d, Expression 
of differentially regulated genes across cell types and organs. Gene clusters are 
annotated with enriched terms based on gene set analysis. e, Correlation of 
gene expression across cell types and organs. Sample size: n=4 (b) andn=3 
(c-e) independent biological replicates. 


LTBR or PDGFRa. Finally, E-cadherin could not enhance or replace our 
sorting of epithelium as EpCAM* cells. 

The cellular identity of the sorted endothelium, epithelium and 
fibroblasts was further confirmed by RNA-seq data analysis, which 
showed the expected organ-specific (Extended Data Fig. 3a) and 
cell-type-specific (Extended Data Fig. 3b) patterns of gene expres- 
sion compared to published multi-tissue expression profiles'*””. 
We also observed the expected expression patterns for various 
cell-type-specific marker genes, but with a high degree of transcrip- 
tional heterogeneity across organs (Extended Data Fig. 3c, Supple- 
mentary Table 2). Notably, the expression profiles of structural cells 
within the same organ were globally more similar to each other than 
structural cells of the same type across organs (Fig. 1d, e, Extended Data 
Fig. 3d), which suggests that the tissue and organ environment has a 
major effect on the transcriptomes of structural cells. Our multi-omics 
profiles for endothelium, epithelium and fibroblasts across 12 mouse 
organs thus uncover a marked degree of organ-specific differences 
among structural cells. 


Immune gene activity in structural cells 


On the basis of our RNA-seq dataset, we investigated the immune gene 
activity in structural cells. First, we assessed the ability of structural 
cells to communicate with haematopoietic immune cells, by inferring 
anetwork of potential cell-cell interactions based on known receptor- 
ligand pairs. The resulting cell-cell interaction network (Fig. 2a, Sup- 
plementary Table 3) predicted frequent crosstalk between structural 
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Fig. 2|Gene expression of structural cells predicts cell-type-specific and 
organ-specific crosstalk with haematopoietic immune cells. a, Network of 
potential cell-cell interactions between structural cells and haematopoietic 
immune cells inferred from gene expression of known receptor-ligand pairs. 
NK cell, natural killer cell. b, Expression of receptors and ligands in structural 


cells and haematopoietic cells under homeostasis (Fig. 2a, Extended 
Data Fig. 4a). Differences across cell types and organs were driven by the 
characteristic expression patterns of cell-surface proteins and secreted 
factors in structural cells (Fig. 2b). For example, strong expression of 
collagens (Col4a3, Col4a4, ColSal1, Col6al and Col12a1) and of comple- 
ment component 3 (C3) in fibroblasts is expected to enhance their abil- 
ity to interact with haematopoietic cells; high levels of Muc2 and Arg2in 
the digestive tract fosters interactions with macrophages and natural 
killer cells; and Ccl25 expression in thymus epithelium contributes to 
the maturation of T cells. Shared across all types of structural cells, 
we observed high expression of the inflammatory mediators Apoe, 
$100a8 and S100a9. 

Second, we quantified the aggregated activity of various immune 
gene modules, which were manually curated to capture important com- 
ponents of theimmune system (Extended Data Fig. 4b, Supplementary 
Table 4). We observed widespread activity of these immune gene mod- 
ules in structural cells, with highly cell-type-specific and organ-specific 
patterns (Fig. 2c). Next, we sought to validate the unexpectedly strong 
and diverse immune gene activity of structural cells inasecond dataset. 
To that end, we obtained single-cell transcriptome profiles of mouse 
tissues from the Tabula Muris resource”, and we identified endothe- 
lium, epithelium and fibroblasts bioinformatically on the basis of the 
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cells, annotated with the cell-cell interactions that they may mediate (genes 
discussed in the text are in bold). RPKM, reads per kilobase of transcript per 
million mapped reads. c, Gene signatures of receptors (R) and ligands (L) in 
structural cells. Sample size (all panels): n=3 independent biological 
replicates. 


expression of marker genes. Although structural cells were not well 
covered in this dataset (Extended Data Fig. 5a, Supplementary Table 
5), these profiles were sufficient to independently confirm strong and 
organ-specific activity of immune genes in structural cells (Extended 
Data Fig. 5b-d). Together, these data reveal widespread activity of 
immune genes and regulatory modules in structural cells, which are 
expected to mediate cell-type-specific and organ-specific interactions 
with haematopoietic immune cells. 


Regulatory networks in structural cells 


To uncover the regulatory basis of immune gene activity in structural 
cells, we combined our RNA-seq data with ATAC-seq profiles of chro- 
matin accessibility and ChIPmentation maps for the promoter and 
enhancer-linked H3K4me2 mark, and we compared gene-regulatory 
networks across cell types and organs (Fig. 3a). 

We observed extensive cell-type-specific and organ-specific chro- 
matin regulation at immune gene loci (Fig. 3b). For example, the 
immune-regulatory transcription factors StatSa and Stat5b showed 
characteristic patterns of chromatin accessibility; Tir9 and Osm 
were characterized by organ-specific heterogeneity in promoter and 
enhancer regions; and the promoter of the renal cell adhesion molecule 
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Fig. 3 | Structural cells implement characteristic gene-regulatory 
networks and an epigenetic potential for immune gene activation. 

a, Schematic outline of the gene-regulatory network analysis. b, ATAC-seq 
signal tracks (average across replicates) for selected genomic regions. 
Differences between cell types and organs are highlighted by black boxes. 

c, Motif enrichment for transcriptional regulators in cell-type-specific and 
organ-specific chromatin marker peaks (one-sided hypergeometric test with 
multiple-testing correction). OR, odds ratio; P,4, adjusted Pvalue. d, Schematic 
outline (left) and aconcrete example (right) of the epigenetic potential, based 
onthe comparison of chromatin accessibility (ATAC-seq) in promoter regions 


Cdh16 was exclusively open in kidney. By contrast, a subset of crucial 
immune genes (exemplified by /fngr1) showed high chromatin accessi- 
bility inmost samples, indicative of a shared core of immune regulation 
in structural cells. Consistent with our RNA-seq analysis, the chromatin 
profiles were globally more similar between different types of struc- 
tural cells in one organ than among the same cell type across organs 
(Extended Data Fig. 6a, b). 

We inferred a gene-regulatory network of structural cells by con- 
necting transcription factors to their target genes, based on predicted 
binding sites with open chromatin in the respective cells (Extended 
Data Fig. 6c). Many key regulators of transcription in structural cells 
showed cell-type-specific and organ-specific activity (Fig. 3c, Extended 
Data Fig. 6d, Supplementary Table 6). For example, ATF, ELK, ETS and 
JUND were most active in lung endothelium; KLF and CDx in diges- 
tive tract epithelium; and HNF in kidney fibroblasts and in epithelium 
of caecum, large intestine and small intestine. We also identified 
groups of transcription factors that were ubiquitously active in struc- 
tural cells, such as ELF1, ELF3, ETS1, FLI1 and GATA2 (Extended Data 
Fig. 6e), which may constitute a shared regulatory basis of immune 
gene activity in structural cells. Our inferred gene-regulatory networks 
support a model in which constitutively active regulators establish a 


Pearson correlation across genes 


with matched gene expression (RNA-seq).e, Scatterplot showing the 
correlation between promoter chromatin accessibility and gene expression 
across all genes in liver endothelium and epithelium, with /fngr2 highlighted by 
the red dot. f, Immune genes with unrealized epigenetic potential across cell 
types and organs. g, Pearsoncorrelation between promoter chromatin 
accessibility and gene expression across cell types and organs (mean and 
s.e.m. across pairwise correlations; red bars indicate the maximum scope for 
unrealized epigenetic potential). Sample size: ATAC-seq n =2 (b-g), RNA-seq 
n=3(c-g) independent biological replicates. 


shared core of immune functions, while additional factors contribute 
cell-type-specific and organ-specific adaptions. 


Epigenetic potential for gene activation 


Epigenome profiles not only capture the current regulatory states 
of cells, but also reflect their future potential to respond to various 
stimuli”. We thus hypothesized that structural cells may be epigeneti- 
cally primed for immune gene activation, which would pre-program 
them fora rapid response to immunological challenges. 

To assess the epigenetically encoded immune potential of structural 
cells, we quantified the chromatin accessibility of each gene promoter and 
compared it to the expression level of the corresponding gene. We then 
scanned for genes with low expression but high promoter accessibility, 
indicative of an unrealized potential for increased expression (Fig. 3d,e, 
Extended Data Fig. 7a). For example, chromatin accessibility at the /fngr2 
promoter was high in liver endothelium whereas gene expression was 
low, which suggests that /fngr2 has unrealized potential for upregulation 
without the need to increase promoter accessibility. By contrast, acces- 
sibility of the /fngr2 promoter in liver epithelium was highly consistent 
with its gene expression, thus constituting a case of realized potential. 
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Fig. 4| Systemic viral infection activates the immunological potential of 
structural cells in vivo. a, Schematic outline of the LCMV infection analysis. 
b, Comparison of the changes in gene expression after LCMV infection (day 8) 
to the epigenetic potential observed under homeostatic conditions (day 0), 
using athreshold of zero for differential gene expression (left) or a variable 
threshold analogous toa receiver operating characteristic curve (centre). 


The genes with unrealized epigenetic potential in structural cells 
were enriched for immune functions (Fig. 3f, Extended Data Fig. 7b, 
Supplementary Tables 7 and 8), consistent with our initial hypothesis. In 
total, we identified 1,665 genes that fulfilled our definition of unrealized 
epigenetic potential, of which 335 genes were annotated with at least 
one immunological term (odds ratio 1.37, P< 10°, Fisher's exact test). 

To quantify and compare the epigenetic potential across cell types 
and organs (Fig. 3g), we exploited that genes with strong unrealized 
potential are outliers when plotting gene expression against promoter 
accessibility, which results in a reduced correlation between the two 
data types (Extended Data Fig. 7a). The highest correlation between 
gene expression and promoter accessibility was observed in brain, cae- 
cum, heart, kidney, large intestine and skin, leaving comparatively less 
room for unrealized epigenetic potential. By contrast, the correlation 
was notably lower in liver, lymph node, spleen and thymus, which sug- 
gests that structural cells in these organs harbour amore pronounced 
epigenetic potential for gene activation in response to various stimuli. 

Our integrative analysis of chromatin accessibility and gene expres- 
sion thus identified an epigenetically encoded potential for immune 
gene activation in structural cells. This epigenetic potential is expected 
to facilitate the rapid response to immunological challenges in a 
cell-type-specific and organ-specific manner. 


Immune genes induced by viral infection 

We functionally evaluated the epigenetic potential for immune gene 
activation by challenging mice with a systemic viral infection model. 
We infected mice with lymphocytic choriomeningitis virus (LCMV) and 
collected samples from 12 organs on day 8 after infection (Fig. 4a). We 
characterized these samples by flowcytometry (Extended Data Fig. 8a, b) 
and by RNA-seq analysis of sort-purified structural cells (Supplemen- 
tary Table 1). 

LCMV infection resulted in changes of structural cell composition in 
most organs (Extended Data Fig. 8c), and we observed differential gene 
expression ina cell-type-specific and organ-specific manner (Extended 
Data Fig. 9a, Supplementary Table 9). The transcriptional response was 
globally more similar among structural cells of all three types within the 
same organ than between structural cells of a given type across organs 
(Extended Data Fig. 9b), consistent with the patterns of similarity that 
we observed under homeostatic conditions. 
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Thearea under the curve is interpreted as a measure of the predictiveness of 
the epigenetic potential for LCMV-induced gene activation, plotted together 
with the percentage of upregulated genes that carry unrealized epigenetic 
potential (right). c, Enrichment of immune-related gene sets among the 
LCMV-induced genes (two-sided Fisher’s exact test with multiple-testing 
correction). Sample size (all panels): n =3 independent biological replicates. 


We compared the differential gene expression after LCMV infection 
with the corresponding epigenetic potential of the genes under homeo- 
static conditions, and genes with unrealized potential were indeed 
overrepresented among the LCMV-induced genes (Fig. 4b, left). To 
quantify how well the epigenetic potential predicts LCMV-induced gene 
activation, we plotted the percentage of genes with realized potential 
against the percentage of genes that underwent de novo activation 
(Fig. 4b, centre), in analogy with receiver operating characteristic (ROC) 
curves. We detected the strongest association in liver, lung and spleen 
(Fig. 4b, right). These organ-specific differences showed no clear cor- 
relation with differences in viral load (Extended Data Fig. 9c) and appear 
to constitute intrinsic regulatory differences between cell types and 
organs (Extended Data Fig. 9d). 

The genes that were upregulated in response to systemic LCMV infec- 
tion showed strong enrichment for immune functions (Fig. 4c), includ- 
ing ‘positive regulation of immune response’, ‘defence response to 
virus’, ‘cellular response to IFNy’ and ‘antigen processing and presenta- 
tion’. We also observed widespread upregulation of interferon-induced 
as well as interferon-stimulated genes and of key transcriptional regu- 
lators in the interferon pathway, which indicates a strong interferon 
response to LCMV infection in structural cells (Supplementary Table 9). 

Finally, we inferred receptor-ligand interactions between struc- 
tural cells and haematopoietic immune cells, using the transcriptome 
data of structural cells upon LCMV infection (Extended Data Fig. 9e, f, 
Supplementary Table 10). We observed an increase in the strength 
and scope of predicted cell-cell interactions after LCMV infection as 
compared to homeostatic conditions, largely driven by upregulated 
gene expression levels for receptors and ligands in structural cells, 
including B2m, Cd74, Cd47, Cxcl10, Sdcl1, Sdc4, Tnfrsfla and Vcam1. 

Systemic LCMV infection thus triggered widespread activation of 
immune genes that were lowly expressed but epigenetically poised 
under homeostatic conditions. These results support our model of 
an epigenetically encoded potential for immune gene activation in 
structural cells in the context of viral infection. 


Cytokine response of structural cells 
To characterize the effects of individual cytokines that may contribute 


to the response to LCMV infection, we administered six recombinant 
cytokines in mice (IFNa, IFNy, IL-3, IL-6, TGF and TNF). These cytokines 
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Fig. 5| Cytokine treatment induces cell-type-specific and organ-specific 
changes in structural cellsin vivo. a, Schematic outline of the cytokine 
treatment experiments (left) and number of genes upregulated in each 
experiment (right). b, Genes upregulated in response to treatment with IFNa. 
c,d, Cytokine-induced changes in spleen endothelium (c) and liver fibroblasts 
(d), showing the percentage of LCMV-induced changes that are recapitulated 
by cytokine treatment (top left), enrichment for genes with unrealized 


were selected based on our dataset and published results”, and IL-3 
was included as a control with no known role in LCMV infection. To 
focus on the immediate effects of cytokine signalling, structural cells 
were sort-purified two hours after cytokine injection and subjected to 
RNA-seq profiling (Fig. 5a, Supplementary Table 1). Four organs that 
responded strongly to LCMV infection were included in the analysis 
(large intestine, liver, lung and spleen). 

We compared the structural cell transcriptomes between mice that 
were cytokine-treated and corresponding mock-treated controls, which 
uncovered various cytokine-induced transcriptional changes. Treatment 
with IFNa had the strongest effect, inducing known interferon target 
genes and several genes associated with antiviral immunity (Fig. 5b, 
Extended Data Fig. 10, Supplementary Table 11). We further observed 
widespread cell-type-specific and organ-specific differences (Extended 
Data Fig. 11a), which were not solely due to differential expression of 
individual cell surface receptors (Extended Data Fig. 11b), but appear 
to reflect more general differences in immune gene regulation. 

Inspleen endothelium (highlighted here because of its strong tran- 
scriptional response to LCMV infection), treatment with IFNy and IL-6 
explained a sizable proportion of the LCMV-induced changes (Fig. 5c, 
top left). The cytokine-induced genes included known interferon tar- 
get genes, transcriptional regulators such as Nmi and InppSd (which 
encodes SHIP1), and the pro-inflammatory gene Piezol (Fig. 5c, right, 
Supplementary Table 11). For five of the six cytokines (not including 
IL-3), the upregulated genes showed significant overlap with the genes 
that carried unrealized epigenetic potential under homeostatic con- 
ditions (Fig. 5c, bottom left). These cytokines thus triggered similar 
aspects of the epigenetic potential in spleen endothelium as observed 
for systemic LCMV infection. 

The results were qualitatively different in liver fibroblasts. We 
observed little overlap between the transcriptional response to the 
cytokines and to LCMV infection (Fig. 5d, top left), while there was 
still a strong association with the epigenetic potential (Fig. 5d, bottom 
left). Cytokine administration induced genes involved in metabolic 
processes (Cmpk2, Pofut2) as well as regulators of vesicle or membrane 
trafficking and antigen presentation (Bloci1s6, H2-Eb1, Rabacl and 
Sar1a) (Fig. 5d, right, Supplementary Table 11). Thus, the administra- 
tion of cytokines triggered different aspects of the epigenetic potential 
in liver fibroblasts compared with LCMV infection. 

In summary, structural cells responded in cell-type-specific and 
organ-specific ways to in vivo stimulation with individual cytokines, 
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epigenetic potential among the cytokine-induced genes (bottom left), and 
genes upregulated upon cytokine treatment (genes discussed in the text arein 
bold). Significant enrichments (two-sided Fisher’s exact test, adjusted P< 0.05) 
are labelled with an asterisk. Differential expressionis based ona linear model 
(two-sided test) with multiple-testing correction (b-d). Sample size (all 
panels): n=3 independent biological replicates. 


which allowed us to functionally dissect the effects of systemic LCMV 
infection and provides further validation of the observed epigenetic 
potential for immune gene activity in structural cells. 


Discussion 


Structural cells, including endothelium, epithelium and fibroblasts, are 
important yet underappreciated contributors to mammalian immune 
responses. Here, we systematically investigated immune gene regula- 
tion in these non-haematopoietic cell types. To that end, we applied 
multi-omics profiling and integrative bioinformatic analysis to three 
types of structural cells purified from twelve different organs of the 
mouse. We observed unexpectedly strong and densely regulated 
expression of immune genes, both under homeostatic conditions and 
in response to immunological challenges (systemic viral infection with 
LCMV, in vivo cytokine treatment). 

Immunologists tend to consider structural cells mainly for their bar- 
rier function (epithelium, endothelium) and their role as connective 
tissue (fibroblasts), although important research has identified much 
more diverse roles of structural cells in mammalianimmunity” “. How- 
ever, comparative multi-organ investigations of structural cells have 
been lacking. We sought to close this gap in our understanding of mam- 
malian immunity by evaluating the immune regulation of structural 
cellsinasystematic, genome-wide and organism-scale way. We identi- 
fied three main lines of evidence that highlight the immune-regulatory 
potential of structural cells. 

First, we observed extensive cell-type-specific and organ-specific 
regulation of genes that influence the capabilities of structural cells to 
engage in predicted interactions with haematopoietic immune cells. 
On the basis of our transcriptome data, we inferred an initial network 
of potential cell-cell interactions between structural cells and hae- 
matopoietic cells. It will be an important future goal to dissect these 
cell-cell interactions and to untangle the precise chain of command 
in the immunological communication of structural cells and haema- 
topoietic cells. 

Second, our genome-wide analysis of chromatin accessibility in 
structural cells uncovered not only a regulatory basis of their immune 
functions, but also an initial assessment of the transcriptional regula- 
tors that confer cellular identity to endothelial cells, epithelial cells 
and fibroblasts across 12 organs. Future studies could pursue genetic 
manipulation of these candidate regulators in acell-type-specific and 
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organ-specific way, investigating the effect on immune-regulatory 
functions, epigenetic landscapes and cellular identity. 

Third, our integrative analysis of gene expression and chromatin 
accessibility identified an epigenetically encoded immune potential in 
structural cells, constituted by genes that were lowly expressed under 
homeostatic conditions but epigenetically poised for much higher 
expression. These genes were enriched for immune functions, and 
they were preferentially upregulated in response to LCMV infection 
and cytokine administration in vivo. We thus conclude that structural 
cells are epigenetically pre-programmed for a swift response toa variety 
of immunological challenges. It will be interesting to explore how the 
epigenetic potential of structural cells responds to other stimuli and 
whether it can be modulated for therapeutic purposes, for examplein 
the context of autoimmune diseases or the tumour microenvironment. 

In conclusion, our study provides a comprehensive characteriza- 
tion of immune gene regulation in structural cells, and an initial step 
towards the systematic, organism-scale dissection of immune func- 
tions beyond haematopoietic cells. To emphasize the importance of 
structural cells for mammalian immunity, we tentatively propose the 
term ‘structural immunity’ for the study of immune functions in the 
non-haematopoietic, structural cell populations of the body. We see 
our study and large-scale dataset as a starting point, reference atlas, 
and acollection of hypotheses for systematic as well as mechanistic 
explorations in this emerging area of research. 
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Methods 


Mice 

C57BL/6J mice were bred and maintained under specific pathogen 
free conditions at the Institute of Molecular Biotechnology (IMBA) of 
the Austrian Academy of Sciences in Vienna (Austria). In vivo experi- 
ments (systemic viral infection, cytokine treatments) were performed 
under specific pathogen-free conditions at the Anna Spiegel Research 
Building of the Medical University of Vienna (Austria). Age-matched 
male mice (8 to 13 weeks old) were used in all experiments. For the 
characterization of structural cells under homeostatic conditions, 
mice were killed without any previous treatment. For the systemic viral 
infection experiments, mice were intravenously infected with 2 x 10° 
focus-forming units of LCMV strain clone 13” and killed on day 8 
after infection. For the cytokine treatment experiments, mice were 
intravenously injected with 100 pg kg” of the following recombinant 
cytokines: IFNa, IFNy, IL-3, IL-6, TGF or TNF (all from BioLegend) and 
killed 2 h after cytokine injection. All mouse experiments were per- 
formed in individually ventilated cages according to the respective 
animal experiment licenses (BMWFW-66.009/0199-WF/V/3v/2015 and 
BMWFW-66.009/0361-WF/V/3b/2017) approved by the institutional 
ethical committees and the institutional guidelines at the Department 
for Biomedical Research of the Medical University of Vienna. Samples 
numbers are listed in the figure legends. No statistical methods were 
used to predetermine sample size. The experiments were not rand- 
omized, and investigators were not blinded to treatment status during 
experiments and outcome assessment. 


Standardized sample collection and organ dissociation 

Different surface markers and sorting schemes were previously used 
to purify endothelium, epithelium and fibroblasts in individual organs, 
whereas our study required standardized cell purification across 
organs. We therefore tested several surface markers across organs 
and optimized a sorting scheme that produced consistent results for 
all12 investigated organs, while excluding cell types that were detect- 
able only in one or a few tissues (most notably lymphatic endothelial 
cells). The experimental workflow followed the recommendations of 
the Immunological Genome Project** regarding sample collection 
schedule, antibody staining and sample pooling. At least three same-sex 
littermate mice were pooled for each biological replicate for homeo- 
static conditions and for LCMV infection, respectively. For the in vivo 
cytokine treatments, we used individual mice as biological replicates 
to reduce the total number of mice. Standardized organ harvesting and 
dissociation protocols were established to obtain single-cell suspension 
for subsequent cell purification by fluorescence activated cell sorting 
(FACS). The same digestion solution was used for all organs, to avoid 
organ-specific technical confounders. The workflow is described in 
detail below. 


Brain. After decapitation, the skull was cut longitudinally with scissors, 
and the cranium was opened with tweezers. Both brain hemispheres 
were carefully collected and placed into cell culture dishes containing 
cold PBS supplemented with 0.1% BSA (PBS + BSA). White matter was 
manually removed, the tissue shredded with scissors and added toa 
50 ml tube containing 15 ml cold Accumax (Sigma-Aldrich) digestion 
solution and incubated for 45 min at 37 °C while shaking at 200 rpm. 
Remaining tissue fragments were processed with a Dounce homog- 
enizer (Sigma-Aldrich) followed by centrifugation at 300g for 5 min 
at 4°C. Myelin was removed by using density gradient centrifugation. 
Cells were recovered at the interface between an 80% Percoll layer and 
a30% Percoll layer and washed in PBS + BSA to remove excess Percoll. 


Caecum, large intestine and small intestine. Luminal contents were 
removed. The organs were cut longitudinally with scissors, rinsed sev- 
eral times in PBS + BSA to remove mucus, then cut into 0.5 cm pieces 


and placed in 50 ml tubes containing 20 ml pre-warmed (37 °C) RPMI 
containing 10% FCS and 5 mM EDTA (RPMI + FCS + EDTA). Samples 
were incubated for 25 min at 37 °C while shaking at 200 rpm. Super- 
natant was collected, samples were resuspended once again in 20 ml 
pre-warmed RPMI + FCS + EDTA, and incubated for 25 min at 37 °Cina 
shaking incubator at 200 rpm. These wash steps were performed to dis- 
sociate epithelial cells. After the second incubation, supernatants were 
collected and combined, followed by digestion of the samples in15 ml 
cold Accumax for 45 min at 37 °C while shaking at 200 rpm. Remaining 
tissue fragments were processed with a Dounce homogenizer. Organ 
homogenates were combined with the epithelial cell fractions, filtered 
twice through a 100-pm cell strainer and washed in cold PBS + BSA. 


Heart, kidney, lung, spleen and thymus. Organs were rinsed with cold 
PBS + BSA and shredded with scissors. Tissue fragments were placed 
into 50 ml tubes containing 20 ml cold Accumax digestion solution 
and incubated for 45 min at 37 °C while shaking at 200 rpm. Remain- 
ing tissue fragments were processed with a Dounce homogenizer and 
filtered through a100-pm cell strainer. After centrifugation, 2 ml cold 
ACK lysis buffer (Thermo Fisher Scientific) was added for 3 minto lyse 
red blood cells and the reaction stopped by adding 20 ml of cold PBS + 
BSA. Supernatants were filtered twice through a 100-pm cell strainer 
and washed once to remove residual ACK lysis buffer. 


Liver. Three lobes were removed, rinsed with cold PBS + BSA, and 
shredded with scissors. Tissue fragments were placed into a50 ml tube 
containing 20 ml cold Accumax digestion solution and incubated for 
45 min at 37 °C while shaking at 200 rpm. Remaining tissue fragments 
were processed with a Dounce homogenizer and filtered througha 
100-pm cell strainer. Hepatocytes were removed using density gradi- 
ent centrifugation. Cells recovered at the interface between an 80% 
Percoll layer anda 30% Percoll layer were washed in PBS + BSA to remove 
excess Percoll. 


Lymph nodes. Cervical, axillary and inguinal lymph nodes were com- 
bined, carefully pinched with tweezers, and rinsed several times with 
cold PBS + BSA to release haematopoietic cells. Tissue fragments were 
placed into a50 ml tube containing 10 ml cold Accumax digestion solu- 
tion and incubated for 45 min at 37 °C while shaking at 200 rpm. Remain- 
ing tissue fragments were processed with a Dounce homogenizer and 
filtered twice through a100-um cell strainer. 


Skin. Ears were obtained at the base, and the subcutaneous fat layer 
scrapped off witha scalpel. Tissue fragments were then shredded with 
scissors, placed into 50 ml tubes containing 15 ml of Accumax digestion 
solution, and incubated for 45 min at 37 °C while shaking at 200 rpm. 
Remaining tissue fragments were processed with a Dounce homog- 
enizer and filtered twice through a100-um cell strainer. 


Organ-specific sample collection and organ dissociation 

Large intestine. The organ was removed and processed as described®. 
Inbrief, luminal contents were removed, the large intestine cut longitu- 
dinally with scissors, rinsed several times in PBS + BSA to remove mucus, 
then cut into 0.5 cm pieces and placed in 50 ml tubes containing 20 ml 
pre-warmed (37 °C) RPMI containing 10% FCS and 5 mM EDTA (RPMI + 
FCS + EDTA). Samples were incubated for 40 min at 37 °C while shaking 
at 200 rpm. Supernatant was collected, samples were resuspended 
once again in 20 ml pre-warmed RPMI + FCS + EDTA and incubated for 
20 min at 37 °C while shaking at 200 rpm. These wash steps were per- 
formed to dissociate epithelial cells. After the second incubation, 
supernatants were collected and combined, followed by incubating 
samples in RPMI containing 10% FCS and 15 mM HEPES (RPMI + FCS + 
HEPES) for 10 min at room temperature. Supernatant was discarded 
and samples digested in RPMI + FCS + HEPES containing 100 U mI 
collagenase from Clostridium histolyticum (Sigma) for 1h at 37 °C while 
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shaking at 200 rpm. Remaining tissue fragments were processed with 
a Dounce homogenizer. Organ homogenates were combined with the 
epithelial cell fractions, filtered twice through 100-um cell strainers 
and washed in cold PBS + BSA. 


Lung. The organ was removed and processed as described”. In brief, 
the organ was cut into 0.5-cm pieces and placed in gentleMACS C tubes 
(Miltenyi) containing 160 U mI" collagenase type 1 (Gibco) and12U mI 
DNase 1 (Sigma) in RPMI containing 5% FCS, and dissociated using a 
gentleMACS Dissociator (Miltenyi; program m_lung_01). After incuba- 
tion at 37 °C for 35 min while shaking at 170 rpm, digested samples were 
homogenized using the gentleMACS Dissociator (program m_lung 02). 
Subsequently, cell suspensions were filtered through 70-1m cell strain- 
ers and centrifuged for 5 min at 4 °C at 300g. After centrifugation, 1 
ml cold ACK lysis buffer (Thermo Fisher) was added for 5 min to lyse 
red blood cells and the reaction stopped by adding 20 ml of cold PBS 
+ BSA. Supernatants were filtered twice through 100-ym cell strainer 
and washed once to remove residual ACK lysis buffer. 


Liver. Mice were anaesthetized (ketamine:xylazine 1:3, 0.1 ml per 10g 
mouse; Vetoquinol) before cannulation of the liver and dissociation 
using a two-step perfusion protocol”. In brief, the liver was perfused 
first with 20 ml HBSS (Gibco) containing 0.5 mM EGTA (Sigma) and then 
with 20 mI L15 medium (Gibco) containing 40 mg“ liberase (Roche) ata 
rate of 5ml min“. Next, the liver was removed, placed ina Petri dish with 
10 ml of the same liberase-containing medium, followed by removal of 
the liver capsule. Hepatocytes were removed from the resulting cell 
suspension using density gradient centrifugation. Cells recovered at 
the interface between an 80% Percoll layer and a30% Percoll layer were 
washed in PBS + BSA to remove excess Percoll. 


Flow cytometry and FACS 

Single-cell suspensions were washed once with PBS containing 0.1% 
BSA and 5mMEDTA (PBS + BSA + EDTA). Cells were then incubated with 
anti-CD16/CD32 (clone 93, BioLegend) to prevent nonspecific binding. 
Single-cell suspensions were then stained with different combina- 
tions of antibodies against CD45 (PerCP-Cy5.5, clone 30-F11), TER-119 
(PerCP-Cy5S.5, clone TER-119), GP38 (PE, clone 8.1.1), E>DCAM (Pe-Cy7, 
clone 8.8), LTBP (APC, clone 5G11), CD31 (FITC, clone MEC13.3), CD90.2 
(AF700, clone 30-H12), CD106 (AF647, clone 429), CD144 (BV421, clone 
BV13), CD324 (APC-Cy7, clone DECMA-1) (all from BioLegend), CD140a 
(BV605, clone APAS, BD Bioscience), MAdCAMI (BV421, clone MECA- 
367, BD Bioscience) and LYVE1 (eFlour 660, clone ALY7, Thermo Fisher 
Scientific) for 30 min at 4 °C, followed by two washes with PBS + BSA 
+ EDTA. Dead cells were stained by adding either Zombie Red Fixable 
Viability Dye or Zombie Aqua Fixable Viability Dye (both from Bio- 
Legend) immediately before flow cytometry characterization or cell 
sorting. For flow cytometry, cells were acquired with an LSRFortessa 
(BD Biosciences) cell analyser using the outlined gating strategies 
(Extended Data Figs. 1a, 2a, 8a). For FACS, cells were sort-purified with 
aMoFlo Astrios (Beckman Coulter) or SH800 (Sony) using the outlined 
gating strategies (Extended Data Figs. 1a, 8a). Data analysis was per- 
formed using the FlowJo software (v.10.5.3, Tree Star). 


Transcriptome profiling by Smart-seq2 

Smart-seq2 was performed as previously described“, starting from 
low-input bulk samples. In each experiment, a maximum of 200 cells 
were sort-purified and deposited in 96-well plates containing 4 pl lysis 
buffer (1:20 solution of RNase Inhibitor (Clontech) in 0.2% (v/v) Triton 
X-100 (Sigma-Aldrich)), spun down and immediately frozen at —80°C. 
Reverse transcription was performed using SuperScript II (Invitrogen) 
followed by PCR amplification with KAPA HiFi HotStart Ready Mix (Kapa 
Biosystems). CDNA amplification was followed by two rounds of SPRI 
(Beckman Coulter) purification, and cDNA concentration was measured 
with a Qubit fluorometer (Life Technologies). Library preparation was 


conducted on 1ng of cDNA using the Nextera XT DNA Sample Prepara- 
tion Kit (Illumina), followed by SPRI (Beckman Coulter) size selection. 
Libraries were sequenced by the Biomedical Sequencing Facility at 
CeMM using the Illumina HiSeq 3000/4000 platform and the 50-bp 
single-end configuration. Transcriptome profiling by Smart-seq2 was 
done in three biologically independent experiments. Sequencing 
statistics are provided in Supplementary Table 1. 


Chromatin accessibility mapping by ATAC-seq 

ATAC-seq was performedas previously described’**’®, with minor adap- 
tations. In each experiment, amaximum of 50,000 sort-purified cells 
were collected at 300g for 5 min at 4 °C. After centrifugation, the pellet 
was carefully resuspended in the transposase reaction mix (12.5 pl2xTD 
buffer, 2 ul TDE1 (Illumina), 10.25 pl nuclease-free water and 0.25 p11 1% 
digitonin (Promega)) for 30 min at 37 °C. Following DNA purification 
with the MinElute kit eluting in 11 pl, 1 pl of eluted DNA was used ina 
quantitative PCR (qPCR) reaction to estimate the optimum number of 
amplification cycles. The remaining 10 pl of each library were ampli- 
fied for the number of cycles corresponding to the C, value (thatis, the 
cycle number at which fluorescence has increased above background 
levels) fromthe qPCR. Library amplification was followed by SPRI (Beck- 
man Coulter) size selection to exclude fragments larger than 1,200 bp. 
DNA concentration was measured with a Qubit fluorometer (Life Tech- 
nologies). Library amplification was performed using custom Nextera 
primers’. Libraries were sequenced by the Biomedical Sequencing 
Facility at CeMM using the Illumina HiSeq 3000/4000 platform and 
the 50-bp single-end configuration. Chromatin accessibility mapping 
by ATAC-seq was done in two biologically independent experiments. 
Sequencing statistics are provided in Supplementary Table 1. 


Epigenome mapping by ChIPmentation for H3K4me2 
ChIPmentation was performed as previously described**””, with minor 
adaptions. In each experiment, amaximum of 50,000 sort-purified cells 
were washed once with PBS and fixed with 1% paraformaldehyde for 
10 min at room temperature. Glycine (0.125 M final concentration) was 
added to stop the reaction. Cells were collected at 500g for 10 min at 
4°C (subsequent work was performed on ice and used cooled buffers 
andsolutions unless otherwise specified) and washed once withice-cold 
PBS supplemented with 1 mM phenyl methyl sulfonyl fluoride (PMSF). 
After centrifugation, cells were lysed in sonication buffer (10 mM 
Tris-HCI pH 8.0, 1mM EDTA pH 8.0, 0.25% SDS, 1x protease inhibitors 
(Sigma-Aldrich) and 1 mM PMSF) and sonicated (Covaris S220) for 
30 mininamicroTUBE until the size of most fragments was inthe range 
of 200 to 700 bp. Following sonication, the lysate was adjusted to RIPA 
buffer conditions (final concentration: 10 mM Tris-HCI pH 8.0, 1 mM 
EDTA pH 8.0, 140 mM NaCl, 1% Triton X-100, 0.1% SDS, 0.1% sodium 
deoxycholate, 1x protease inhibitors (Sigma-Aldrich) and 1mM PMSF). 

For eachimmunoprecipitation, 10 pp] magnetic Protein A (Life Tech- 
nologies) were washed twice and resuspended in PBS supplemented 
with 0.1% BSA. One microgram of antibody recognizing H3K4me2 
(Sigma-Aldrich, clone AW30) was added and bound to the beads by 
rotating overnight at 4 °C. Beads were added to the sonicated lysate 
and incubated for 2 hat 4 °C ona rotator followed by washing the beads 
once with RIPA low-salt buffer (10 mM Tris-HCl, 150 mM NaCl, 0.1% SDS, 
0.1% sodium deoxycholate, 1% Triton X-100 and 1mM EDTA), once with 
RIPA high-salt buffer (10 mM Tris-HCI, 500 mM NaCl, 0.1% SDS, 0.1% 
sodium deoxycholate, 1% Triton X-100 and 1mM EDTA), once with RIPA 
lithium-chloride buffer (10 mM Tris-HCl, 250 mM LiCl, 0.5% IGEPAL 
CA-630, 0.5% sodium deoxycholate and 1 mM EDTA) and once with 
Tris-Cl pH 8. Bead-bound chromatin was then resuspended in tagmen- 
tation mix (5 pI 5xTD buffer, 1 pl TDE1 (Illumina), 19 pl nuclease-free 
water) and incubated for 10 min at 37 °C. 

After tagmentation, the beads were washed once with RIPA and once 
with cold Tris-Cl pH 8. Bead bound tagmented chromatin was resus- 
pended in 10.5 p120 mM EDTA and incubated for 30 min at 50 °C. Then, 


10.5 120 mM MgCl, as well as 25 pl 2x KAPA HiFi HotStart Ready Mix 
(Kapa Biosystems), pre-activated by incubation at 98 °C for 45s, were 
added and incubated for 5 min at 72 °C, followed by incubation for 
10 min at 95 °C. Beads were magnetized and 2 ul of each library were 
amplified in a10 pl qPCR reaction containing 0.8 mM primers, SYBR 
Green and 5 pl Kapa HiFi HotStart ReadyMix to estimate the optimum 
number of enrichment cycles, using the following program: 72 °C for 
5 min, 98 °C for 30s, 24 cycles of 98 °C for 10s, 63 °C for 30s, 72°C for 
30 s, and a final elongation at 72 °C for 1 min. Kapa HiFi HotStart 
ReadyMix was incubated at 98 °C for 45 s before preparation of all 
PCR reactions (qPCR and final enrichment PCR), in order to activate 
the hot-start enzyme for successful nick translation at 72 °C in the first 
PCR step. 

Final enrichment of the libraries was performed in a 50 ul reac- 
tion using 0.75 mM primers (custom Nextera primers as described 
for ATAC-seq) and 25 ml Kapa HiFi HotStart ReadyMix. Libraries were 
amplified for the number of cycles corresponding to the C, value deter- 
mined in the qPCR reaction. Enriched libraries were purified using 
SPRI beads (Beckman Coulter). To prepare input control samples, 3 pl 
of 50 mM MgCl, was added to 15 pl sonicated lysate (pool of 5 pl of 
endothelium, epithelium and fibroblast lysates from the same organ) 
to neutralize the EDTA in the SDS lysis buffer; 20 pl of tagmentation 
buffer and 1 pl transposase (Illumina) was added, and samples were 
incubated at 37 °C for 10 min; chromatin was purified with MinElute 
PCR purification kit (Qiagen), and 22.5 pl of the purified transposition 
reaction were combined with 25 pl of PCR master mix and 0.75 mM 
primers (custom Nextera). Control libraries were amplified as described 
above. Libraries were sequenced by the Biomedical Sequencing Facility 
at CeMM using the Illumina HiSeq 3000/4000 platform and the 50-bp 
single-end configuration. Epigenome mapping by ChIPmentation was 
done in two biologically independent experiments, with two excep- 
tions: for endothelium from lymph node and fibroblasts from thymus, 
only one high-quality profile could be obtained. Sequencing statistics 
are provided in Supplementary Table 1. 


Quantification of LCMV viral RNA 

Inthe LCMV infection experiments, samples were collected at day 8 
after infection and snap-frozen in liquid nitrogen. For detection of 
LCMV viral RNA, pieces of the organs were homogenized with a Tis- 
sueLyser II (Qiagen), RNA was extracted using QIAzol lysis reagent 
(Qiagen), and reverse transcription was done using random primers 
and the First Strand cDNA Synthesis Kit (Thermo Fisher Scientific) 
according to the manufacturer’s instructions. The expression of the 
viral gene that encodes the LCMV nucleoprotein was measured with 
qPCR using TaqMan Fast Universal PCR Mastermix (Thermo Fisher 
Scientific) and previously published primers and probes*’. The qPCR 
data were normalized against a reference comprising five established 
housekeeping genes (TaqMan Gene Expression Assays, Thermo Fisher 
Scientific): Arfl (Mm01946109_uH), Rp/37 (Mm00782745 s1), Rab1b 
(Mm00504046 ¢1), Effa (Mm04259522 ¢1) and Gapdh (4352339E), to 
minimize biases introduced by potential tissue-specific expression of 
housekeeping genes*!”. Notably, this qPCR assay estimates infection 
levels based on LCMV viral RNA in bulk tissue, and it cannot account for 
organ-specific differences in the relative frequencies of cells that are 
susceptible to LCMV infection. The results should therefore be regarded 
as an indication (rather than as a precise quantitative measure) of the 
degree to which structural cells in different organs are directly and 
indirectly affected by the LCMV infections. 


RNA-seq data processing and quality control 

The RNA-seq data were processed and quality-controlled using estab- 
lished bioinformatics software and custom analysis scripts. Specific 
emphasis was put on ensuring high purity of the structural cell tran- 
scriptomes, while minimizing the risk and effect of potential contami- 
nations (for example, as a result of cell-free RNA released by dying cells 


or potential impurities during FACS purification of the structural cell 
populations). 

Raw reads were trimmed using trimmomiatic (v.0.32) and aligned 
to the mouse reference genome (mm10) using HISAT2 (v.2.1.0). Gene 
expression was quantified by counting primary alignments to exons 
using the function ‘summarizeOverlaps’ from the GenomicAlignments 
package (v.1.6.3) in R (v.3.2.3). Gene annotations were based on the 
Ensembl GENCODE Basic set (genome-build GRCm38.p6). 

To detect and remove potential biases arising from contaminating 
RNA of haematopoietic immune cells among the sorted structural cell 
populations, we screened all transcriptome profiles for gene expression 
signatures associated with eight types of haematopoietic immune cells. 
To that end, we derived gene signatures indicative of B cells, T cells, 
natural killer cells, natural killer T cells, macrophages, monocytes, 
dendritic cells and neutrophils froma recent mouse single-cell expres- 
sion atlas”. In addition, a T cell receptor signature was defined by the 
genes Cd3d, Cd3e, Cd3g, Lat, Lck, Vav1, Tbx21 and Zap70. For each of 
these gene signatures, aggregated expression values were calculated 
as follows. First, raw read counts were converted to counts per million 
(CPM) and log,-transformed using the function ‘voom’ from the limma 
package (v.3.26.9) in R. Second, voom-converted values were normal- 
ized to z-scores, followed by averaging across all genes in each gene 
signature. Samples with detectable contamination by haematopoietic 
immune cells were automatically discarded, and replacement samples 
were generated until three uncontaminated samples were available for 
each cell type in each organ. 

As an additional precaution, we computationally corrected for 
residual contamination by RNA from haematopoietic immune cells. 
To that end, we regressed out the gene signatures of haematopoietic 
immune cells from the matrix of structural cell transcriptomes using 
the function ‘removeBatchEffect’ from the limma package in R. With 
this procedure, we generated corrected log,-transformed counts per 
million (log,(CPM)), which we used in all further analyses unless other- 
wise stated. Finally, for the comparisons between individual genes, we 
normalized the corrected expression values by gene length, thereby 
generating RPKM values. 


Bioinformatic analysis of cell-type-specific and organ-specific 
gene expression 

The processed and quality-controlled RNA-seq profiles of structural 
cells were analysed for characteristic differences in gene expression 
across cell types and organs, and enrichment analysis against public 
reference data was used to obtain an initial annotation of the identified 
gene signatures (Fig. 1d, Extended Data Fig. 3). 

Cell-type-specific and organ-specific marker genes were identi- 
fied using a two-step procedure. First, we performed pairwise dif- 
ferential expression analysis for each of the three type of structural 
cells, comparing them across organs. Second, based on the resulting 
pairwise comparisons, we identified for each organ those genes that 
were upregulated in each cell type compared to at least five other 
organs. Pairwise comparisons between organs were performed using 
the limma package in R, separately for each of the three structural cell 
types. Significantly differential genes were selected based on statisti- 
cal significance (adjusted P< 0.05), average expression (log,(CPM) 
>-1) and sequencing coverage (median number of reads greater than 
10 in the group with stronger signal). On the basis of these pairwise 
comparisons, we counted the total number of times each gene was 
upregulated in a specific organ compared to all other organs. Genes 
that were upregulated in comparison to five or more other organs were 
selected as marker genes of the corresponding organ. 

The identified cell-type-specific and organ-specific marker genes 
were subjected to enrichment analysis with Enrichr (http://amp.pharm. 
mssm.edu/Enrichr/), using three Enrichr libraries: KEGG_2019 Mouse, 
GO Biological _Process_2018 and Human Gene Atlas. The enrichment 
analysis provided an initial biological annotation of the identified 
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marker genes (Fig. 1d) and validated their enrichment for previously 
reported cell-type-specific and organ-specific marker genes expressed 
instructural cells (Extended Data Fig. 3a). For further confirmationinan 
independent reference dataset”, we normalized the expression values 
for each gene within each organ in our dataset, followed by averaging 
across genes for each set of marker genes from the reference dataset 
(Extended Data Fig. 3b). The resulting structural cell signatures thus 
reflect gene expression in one cell type compared to the other cell 
types in the same organ. Finally, we normalized these structural cell 
gene signatures across organs and cell types to z-scores, resulting in 
cell-type-specific and organ-specific gene signatures. 


Bioinformatic inference of cell-cell interactions 

To dissect the cell-type-specific and organ-specific crosstalk of struc- 
tural cells with haematopoietic immune cells, we quantified the enrich- 
ment of known receptor-ligand pairs among the identified marker 
genes (Fig. 2a). Moreover, we assessed the expression of genes encoding 
receptors and ligands (Fig. 2b) and aggregated receptor and ligand gene 
signatures (Fig. 2c). Given the scale of our study, it was not possible to 
validate the inferred cell-cell interactions experimentally; however, the 
identification of many well-established interactions and of biologically 
plausible differences between cell types and organs provides support 
for these predictions. 

Cell-type-specific and organ-specific marker genes of structural cells 
were obtained from the RNA-seq data as described in the previous sec- 
tion. Marker genes of haematopoietic immune cells were downloaded 
from supplementary table 4 of a recent mouse single-cell expression 
atlas; from supplementary table 2 of the Tabula Muris paper”°, from 
data set_SO2 (sheets 8, 12, 14 and 16) ofa paper by the ImmGen consor- 
tium*, and from supplementary table 1 of a large-scale characterization 
of tissue-resident macrophages“. To aggregate immune cell genes into 
aconsensus set, the union of all identified marker genes was taken. 

Receptor-ligand pairs were downloaded from CellPhoneDB* (as of 
21September 2018) and merged with additional receptor-ligand pairs 
extracted from supplementary data 2 of a recent paper“, retaining 
only literature-supported pairs. Mouse gene identifiers were mapped 
to human gene identifiers based on the NCBI HomolGene mapping 
(build 68) using only the unique mappings. Interactions formed by 
receptor complexes from CellPhoneDB were transformed to pairwise 
interactions between individual receptors and ligands, by including 
an interaction between each member of a receptor complex and the 
respective ligand. 

On the basis of these lists of marker genes and receptor-ligand pairs, 
we inferred potential cell-cell interactions for all pairs of one structural 
cell type and one haematopoietic immune cell type, quantifying the 
enrichment for known receptor-ligand pairs amongall pairs of marker 
genes between the structural cell type and the haematopoietic cell 
type (Fig. 2a). First, we counted all pairs of marker genes for each pair 
of structural and immune cell types. Second, we calculated the frac- 
tion of these gene pairs that were annotated as receptor-ligand pairs. 
Third, we tested whether this fraction was greater than the fraction 
of annotated receptor-ligand pairs across all pairs of genes. Fisher’s 
exact test was used to obtain Pvalues and odds ratios, as implemented 
by the function ‘fisher.test’ in R. Pvalues were adjusted for multiple 
testing using the Benjamini-Hochberg method, as implemented the 
function ‘p.adjust’ in R. Finally, significantly enriched pairs (adjusted 
P<0.05) of structural and immune cell types were connected by edges 
to generate a graph of cell-cell interactions. 

In addition to testing for significant enrichment (as described in 
the previous paragraph), we calculated aggregated receptor and 
ligand gene signatures (Fig. 2c) based on a manually curated list of 
immune-related receptors and ligands (Supplementary Table 4). These 
gene signatures were derived by normalizing expression values for 
each gene and averaging across all genes in a given set of biologically 
related receptors and ligands. 


Bioinformatic analysis of the Tabula Muris dataset 

To assess the expression patterns of immune genes in structural cells in 
an independent dataset (Extended Data Fig. 5), we obtained single-cell 
RNA-seq data and ¢-SNE projections from the Tabula Muris website 
https://tabula-muris.ds.czbiohub.org/ (21 December 2018) and cell 
counts from supplementary table 1 of the corresponding paper”. Cell 
types were assigned based on the provided cell ontology class, with 
stroma cells and basal cells of epidermis both labelled as fibroblasts. 
Gene expression counts were converted to transcripts per million (TPM) 
values and log-transformed. Receptor and ligand gene signatures in 
the Tabula Muris data were based on immune-related receptors and 
ligands (Supplementary Table 4) and normalized log(TPM) values 
for each gene, which were averaged across all genes in a given set of 
biologically related receptors and ligands. 


ATAC-seq and ChIPmentation data processing and quality 
control 

The ATAC-seq and H3K4me2 ChiPmentation data were processed using 
well-established bioinformatics pipelines, followed by quality control 
using the same approachas for the RNA-seq data (described above). The 
two data types were processed separately and subsequently integrated 
as described in the next section. 

Raw reads were trimmed with trimmomatic (v.0.32) and aligned to the 
mouse reference genome (mm10) using bowtie2 (v.2.2.4). Primary align- 
ments with mapping quality greater than 30 were retained. ATAC-seq 
peaks were called using MACS (v.2.7.6) on each individual sample. 
H3K4me2 ChIPmentation peaks were called using MACS against the 
input controls (which were obtained by pooling input control data from 
the three types of structural cells in equal amounts for each organ). 
Peaks were aggregated into alist of consensus peaks using the function 
reduce of the package GenomicRanges (version 1.22.4) inR. Consensus 
peak that overlapped with blacklisted genomic regions (downloaded 
from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/ 
mm10-mouse/) were discarded. 

Quantitative measurements were obtained by counting reads within 
consensus peaks using the function ‘summarizeOverlaps’ from the 
GenomicAlignments (v.1.6.3) package in R. Samples with detectable 
contamination by haematopoietic immune cells were identified in the 
same way as for the RNA-seq data, using epigenomic signals in promoter 
peaks to calculate immune cell signatures. Contaminated samples 
were automatically removed and replaced by newsamples. Finally, we 
computationally corrected for residual contamination in the retained 
samples by regressing out epigenomic signatures of haematopoietic 
immune cells from the matrix of signal intensity values across peaks, 
using the function ‘removeBatchEffect’ from the limma package in R. 


Bioinformatic analysis of cell-type-specific and organ-specific 
transcription regulation 
The processed and quality-controlled RNA-seq, ATAC-seq and H3K4me2 
ChiPmentation data were analysed for differences in transcription 
regulation across cell types and organs (Fig. 3a-c). We integrated all 
three data types to derive sets of genomic regions with characteristic 
activity patterns (marker peaks), we performed motif enrichment 
analysis for marker peaks, and we inferred a gene-regulatory network 
by connecting enriched transcriptional regulators to their target genes. 
For the inference of gene-regulatory networks, we used DNA sequence 
motifs to connect transcriptional regulators to regulatory regions, and 
genomic proximity to connect regulatory regions to target genes. To 
make these coarse approximations more robust and interpretable, we 
used the ATAC-seq data to exclude motifs that are most likely inactive 
in the investigated cell types. 

First, cell-type-specific and organ-specific marker peaks were 
identified in analogy to the RNA-seq-based identification of marker 
genes (described above), separately for the ATAC-seq and H3K4me2 


ChIPmentation data. For each of the three cell types, we performed 
pairwise comparisons between organs using the limma package in 
R. Differential peaks were selected based on statistical significance 
(adjusted P< 0.05), signal intensity (log,(CPM) >-—1) and sequencing 
coverage (median number of reads of at least 10 in the group with 
stronger signal). We then counted, separately for the two data types, 
the total number of times each peak showed increased signal ina spe- 
cific organ compared with all other organs. Peaks that were upregulated 
compared to five or more other organs were selected as marker peaks 
of the corresponding organ. 

Second, we derived a gene-linked list of marker peaks that were 
marked as promoter-associated peaks, enhancer-associated peaks 
and other peaks based on gene annotations as well as the RNA-seq 
and H3K4me2 ChiPmentation data (for enhancers). To that end, the 
ATAC-seq peaks were linked to all genes in which the transcription start 
site was located within 5 kb, using the function ‘annotatePeakInBatch’ of 
the package ChIPpeakAnno (v.3.4.6) package in R. They were annotated 
as promoter-associated if they were located within 200 bp from atran- 
scription start site of the respective gene; or as enhancer-associated if 
they were located within 1-5 kb from atranscription start site, showed 
a correlation of ATAC-seq and RNA-seq signal greater than 0.3 and 
overlapped with an H3K4me2 ChiPmentation marker peak. 

Third, we performed motif analysis on the gene-linked list of marker 
peaks to connect regulatory regions to transcriptional regulators. The 
HOMER software tool (v.4.9.1) was used to identify regulator binding 
motifs in the marker peaks and to determine regulator enrichment. 
This enrichment analysis was performed separately for all peaks, 
for promoter-associated peaks and for enhancer-associated peaks 
using the HOMER function ‘findMotifsGenome’ witha background set 
consistent of all gene-linked peaks. To avoid biases due to differences in 
peak size, all regions were standardized to 500 bp around the centre of 
the peak. Transcriptional regulators were labelled as significant based 
on statistical significance (adjusted P< 0.001) and the strength of the 
enrichment (log,(odds ratio) >1.5). 

Fourth, we inferred a gene-regulatory network by connecting tran- 
scriptional regulators with their target genes. This analysis included 
only those regulators that were enriched in the motif analysis of 
gene-linked marker peaks (described in the previous paragraph). No 
regulators were excluded based on low or undetectable RNA expression 
levels, given that transcription factors can play a relevant regulatory 
role despite low expression levels. The gene-regulatory network was 
constructed as follows: (i) transcriptional regulators were linked to 
their target peaks using the HOMER function ‘annotatePeaks ’; (ii) the 
list of gene-linked marker peaks was used to map these peaks to their 
target genes; (iii) the final network was constructed based on the links 
between regulators and target peaks, and between peaks and target 
genes, using the package igraph (v.1.1.2) inR. 

Fifth, we analysed the inferred gene-regulatory network by determin- 
ing thenetwork similarity between all pairs of transcriptional regulators 
using the function ‘similarity’ from the igraph package in R (with inverse 
log-weighted similarity). Similarities among transcriptional regula- 
tors were visualized by multidimensional scaling using the function 
‘cmdscale’ inR, based on similarity measures normalized toa minimum 
of zero and maximum of one, and converted to distance measures by 
taking one minus the normalized similarity value. 


Epigenetic potential for immune gene activation 
To identify genes with unrealized epigenetic potential (that is, genes 
in which the chromatin state indicates much higher expression than 
observed under homeostatic conditions), we compared chromatin 
accessibility (ATAC-seq signal) of promoter regions with the expression 
levels of corresponding genes (RNA-seq signal) (Fig. 3d-g). 

First, RNA-seq and ATAC-seq read counts were combined intoa single 
table, converted to log,(CPM) values and quantile-normalized using 
the function ‘voom’ of the limma package in R. Second, we removed 


systematic differences between the two assays for each gene using 
the function ‘removeBatchEffect’ of the limma package in R. Third, 
we compared normalized RNA-seq and ATAC-seq signal intensities 
for each gene, organ and cell type using differential analysis from 
the limma package in R. This analysis used biological replicates for 
RNA-seq and ATAC-seq to statistically assess for a given combination 
of organ and cell type whether the ATAC-seq signal was dispropor- 
tionately greater (compared to other organs and cell types) than the 
RNA-seq signal. On the basis of this analysis, genes were selected as 
significant based on the observed difference (log,-transformed fold 
change > 0.7), statistical significance (adjusted P< 0.05) and mean 
signal intensity (normalized log,(CPM) > O). For ATAC-seq, we further 
required a minimum sequencing coverage (median number of reads 
>5).In addition, we calculated the enrichment of genes with immune 
functions among genes with unrealized epigenetic potential. To that 
end, we selected the 200 genes with the greatest difference between 
ATAC-seq to RNA-seq signal in each organ and cell type, and we cal- 
culated their enrichment for our manually curated immune-related 
gene sets (Supplementary Table 8) using Fisher’s exact test based on 
the function ‘fisher.test’ in R. 


Bioinformatic analysis of the LCMV infection experiments 

To evaluate the epigenetic potential for immune gene activation in 
structural cells, we compared the RNA-seq profiles of structural cells 
under homeostatic conditions with those collected on day 8 after infec- 
tion with LCMV, and we assessed the predictive ability of the unrealized 
potential for the LCMV-induced changes in gene expression (Fig. 4). 
This analysis was based on the hypothesis that LCMV infection, which 
is widely regarded as asystemic infection model, will affect structural 
cells in most or all investigated organs, which may occur through direct 
effects (structural cells get infected) as well as indirect effects (struc- 
tural cells respond to aspects of tissue-specific or systemic infection). 
Our gene expression analysis likely reflects a superposition of several 
such effects. 

First, we identified the genes that were upregulated upon LCMV 
infection. To that end, we compared samples collected under homeo- 
static conditions with those collected after LCMV infection, separately 
for each cell type and organ, using the limma package in R. Differen- 
tially expressed genes were selected based on statistical significance 
(adjusted P< 0.05), average expression (log,(CPM) > 1) and sequencing 
coverage (median number of reads > 20 in the upregulated condition). 
Enrichment analysis was performed by comparing differential genes 
to our manually curated immune-related gene sets (Supplementary 
Table 8) using Fisher’s exact test with the function ‘fisher.test’ in R. 

Second, we evaluated whether genes with unrealized epigenetic 
potential under homeostatic conditions were preferentially upregu- 
lated in response to LCMV infection. For each cell type and organ, we 
split all genes into two groups based on their log,-transformed fold 
change comparing ATAC-seq signal to RNA-seq signal: Those with 
unrealized potential (ATAC-seq signal greater than RNA-seq signal) 
and those without unrealized potential (RNA-seq signal greater than 
ATAC-seq signal). We then assessed whether the LCMV-induced changes 
in gene expression (log,-transformed fold change comparing samples 
collected after LCMV infection to homeostatic conditions) was able to 
discriminate between the two groups defined by epigenetic potential. 
For visualization and quantitative comparison across cell types and 
organs, the data were plotted in diagrams that conceptually resemble 
receiver operating characteristic curves, and the predictive ability of 
the unrealized potential for gene activation afteer LCMV infection was 
quantified by area-under-the-curve values. 

Third, the activated potential was calculated for each cell type and 
organ as the percentage of genes with unrealized potential under 
homeostatic conditions (ATAC-seq signal greater than RNA-seq signal, 
P<0.05) that were among the significantly upregulated genes after 
LCMV infection (defined as described above). 
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Bioinformatic analysis of the cytokine treatment experiments 
To dissect the in vivo effects of individual cytokines on structural cells 
(Fig. 5), we analysed the transcriptional response to the treatment of 
mice with six cytokines administrated as recombinant proteins, and 
we obtained RNA-seq profiles (processed and quality-controlled as 
described above) for analysis of gene expression. 

First, we compared the gene expression of structural cells between 
cytokine-treated mice and mock-treated (PBS) controls using the 
limma package in R, separately for each cell type and organ. Differen- 
tially expressed genes were selected based on statistical significance 
(adjusted P< 0.05) and average expression (log,(CPM) >2). 

Second, we compared the changes in gene expression upon cytokine 
treatment to the changes in gene expression upon LCMV infection. 
To that end, we correlated cytokine-related log-transformed fold 
changes to LCMV-related log-transformed fold changes across all genes, 
separately for each cell type, organ and cytokine. 

Third, we tested whether genes upregulated after cytokine treatment 
were enriched for genes with unrealized epigenetic potential. To that 
end, we calculated the enrichment of genes with unrealized epigenetic 
potential (defined as described above) among genes upregulated by 
cytokine treatment (adjusted P< 0.05, log-transformed fold change 
> 0), separately for each organ, cell type and cytokine. Fisher’s exact 
test was used to calculate statistical significance with the function 
‘fisher.test’ in R. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Raw and processed sequencing data (RNA-seq, ATAC-seq and H3K4me2 
ChiPmentation) are available from the NCBI Gene Expression Omnibus 
(GEO) repository under accession number GSE134663. In addition, the 
dataset is provided as an online resource ona supplementary website 
(http://structural-immunity.computational-epigenetics.org), which 
includes links to raw and processed sequencing data, further analy- 
sis results and genome browser tracks for interactive visualization of 
the RNA-seq, ATAC-seq and ChIPmentation profiles. Source data are 
provided with this paper. 


Code availability 


The analysis source code underlying the final version of the paper 
is openly available from http://structural-immunity.computational- 
epigenetics.org. 


32. Ahmed, R., Salmi, A., Butler, L. D., Chiller, J. M. & Oldstone, M. B. Selection of genetic 
variants of lymphocytic choriomeningitis virus in spleens of persistently infected mice. 
Role in suppression of cytotoxic T lymphocyte response and viral persistence. J. Exp. 
Med. 160, 521-540 (1984). 


33. Bergthaler, A. et al. Viral replicative capacity is the primary determinant of lymphocytic 
choriomeningitis virus persistence and immunosuppression. Proc. Natl Acad. Sci. USA 
107, 21641-21646 (2010). 

34. ImmGen Consortium. Final ImmGen sorting SOP; http://www.immgen.org/ 

Protocols/ImmGen%20Cell%20prep%20and%20sorting%20SOP.pdf (accessed 24 

March 2020). 

35. Krausgruber, T. et al. T-bet is a key modulator of IL-23-driven pathogenic CD4* T cell 

responses in the intestine. Nat. Commun. 7, 11627 (2016). 

36. Saluzzo, S. et al. First-breath-induced type 2 pathways shape the lung immune 

environment. Cell Rep. 18, 1893-1905 (2017). 

37. Lercher, A. et al. Type | interferon signaling disrupts the hepatic urea cycle and 

alters systemic metabolism to suppress T cell function. Immunity 51, 1074-1087.e9 

(2019). 

38. Corces, M.R. et al. Lineage-specific and single-cell chromatin accessibility charts human 

hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193-1203 (2016). 

39. Gustafsson, C., De Paepe, A., Schmidl, C. & Mansson, R. High-throughput ChiPmentation: 

reely scalable, single day ChIPseq data generation from very low cell-numbers. BMC 

Genomics 20, 59 (2019). 

40. Pinschewer, D. D. et al. Innate and adaptive immune control of genetically engineered 

ive-attenuated arenavirus vaccine prototypes. Int. Immunol. 22, 749-756 (2010). 

Al. ‘ouadjo, K. E., Nishida, Y., Cadrin-Girard, J. F., Yoshioka, M. & St-Amand, J. Housekeeping 

and tissue-specific genes in mouse tissues. BMC Genomics 8, 127 (2007). 

42. Li, B. et al. A comprehensive mouse transcriptomic BodyMap across 17 tissues by 

RNA-seq. Sci. Rep. 7, 4200 (2017). 

43. Shay, T. et al. Conservation and divergence in the transcriptional programs of the 

human and mouse immune systems. Proc. Natl Acad. Sci. USA 110, 2946-2951 

(2013). 

44. Lavin, Y. et al. Tissue-resident macrophage enhancer landscapes are shaped by the local 

microenvironment. Cell 159, 1312-1326 (2014). 

45. Vento-Tormo, R. et al. Single-cell reconstruction of the early maternal-fetal interface in 

humans. Nature 563, 347-353 (2018). 

46. Ramilowski, J. A. et al. A draft network of ligand-receptor-mediated multicellular 
signalling in human. Nat. Commun. 6, 7866 (2015). 


Acknowledgements We thank the Core Facility Flow Cytometry of the Medical University of 
Vienna for cell sorting service; the Biomedical Sequencing Facility at CeMM for assistance with 
next-generation sequencing; S. Zahalka and S. Knapp for help and advice with the preparation 
of lung samples; S. Niggemeyer, J. Riede, S. Jungwirth and N. Fleischmann for animal 
caretaking; and all members of the Bock laboratory for their help and advice. This work was 
conducted in the context of two Austrian Science Fund (FWF) Special Research Programme 
grants (FWF SFB F6102; FWF SFB F7001). T.K. is supported by a Lise Meitner fellowship from the 
Austrian Science Fund (FWF M2403). N.F. is supported by a fellowship from the European 
Molecular Biology Organization (EMBO ALTF 241-2017). A.L. is supported by a DOC fellowship 
of the Austrian Academy of Sciences. A.B. is supported by an ERC Starting Grant (European 
Union‘s Horizon 2020 research and innovation programme, grant agreement no. 677006). C.B. 
is supported by a New Frontiers Group award of the Austrian Academy of Sciences and by an 
ERC Starting Grant (European Union’s Horizon 2020 research and innovation programme, 
grant agreement no. 679146). 


Author contributions T.K. designed the project, performed experiments, analysed data and 
cowrote the manuscript. N.F. designed and performed the bioinformatic analysis and cowrote 
the manuscript. V.F.-G., M.S., L.C.S., A.N. and C.S. contributed to sample collection and 
sequencing library preparation. A.L. contributed to the experimental design and performed 

in vivo experiments (LCMV infections; cytokine treatments). A.F.R. contributed bioinformatic 
software. A.B. contributed to the experimental design and supervised the in vivo experiments. 
C.B. supervised the project and cowrote the manuscript. All authors read, contributed to, and 
approved the final manuscript. 


Competing interests The authors declare no competing interests. 


Additional information 

Supplementary information is available for this paper at https://doi.org/10.1038/s41586-020- 
2424-4, 

Correspondence and requests for materials should be addressed to C.B. 

Peer review information Nature thanks Ari M. Melnick and the other, anonymous, reviewer(s) 
for their contribution to the peer review of this work. 

Reprints and permissions information is available at http://www.nature.com/reprints. 


FSC-A FSC-A FSC-A FSC-A 
10° 
: 10" Epithelium Fibroblasts, 
€ matopoietic cells i ee 
Single_cells ° 
| o 4/0 {Epcam_negative 
<5 EeRRA BRSSE GRRRERRROA 13 
SOK 100K 150K" 200K 250K 0 


SSC-A 


FSC-H 


© 


Viability dye 
PE-TxRed 


Hematopoietic cells 
PerCP-Cy5.5 


Epcam - PE-Cy7 


ane vw 
0 50K 100K 150K 200K 250K 


“Hematopoietic cells 


@ 


Epithelium 7fFibroblasts 


ineage_positive 


| 5 ; pcam_negative 
*non_hematopoietic_cells ineage_negative 
repre cly Tey ee pepe 
10°45 
jot 4Fibroblasts Epithelium wid 
¥ i Fibroblasts 
Caecum 2)” Endothelium 
3 Epcam_negative Endothelium 
herded peering or er TY pen eee ee 
10°F 04 10° 
10° wy Epithelium = Fibroblasts | | 10 Lineage_positive 
ineage_posttve i | 
3 E a4 6 3 
10 104 weer 
Heart 3 : j Skin Endothelium 
* Lineage_negative Endothelium : 
0 J) Lineage neg: pam _negative oT} B Aineage negative 


Large int. 
° 


ineage_positive 


ineage_negative 


“Epithelium 


44 
10" Fibroblasts 


indothelium 


ineage_positive 


Epithelium 


ineage_negativessissi 


Epcam_negative 


“Lineage_positive 


|“ Lineage_negative 


Epithelium 


Fibroblasts 


1eage_positive 


Epithelium ‘ 
ee 10" 7 Fibroblasts 


< Epcam_negative Endothelium 


Epithelium 


Epcam_negative 


ineage_negative 


Epithelium 


Fibroblasts 


Epcam_negative 


Tey ‘ 
0 BOK TUUK 180K 200K 250K 


Cc {J _ WEndothelium | Epithelium jl Fibroblasts S Epithelium 

4 21004 = “> 
Se ° 8 woos . feeasien 
Aa ~* * ° a = 60 Dissociation protocol 
OF 751 ° ae ry o# I Organ-specific 
Do ogo o 8 I Standardized 
ad _ Do 4 
So & 50. ° $ S32 
EoD os Eo 
G9 i Eo 
©. 3. ° Elo © 20 . 

254 ° >a 
ees) 3 ill Go ° 
a8 - ' it s 3¢ 
cis oil AA ‘eo id Pa i GE o a | MO sci 
a2 a f= 25 fe, ©. cae eee of ¢8 2 ¢€3 FP 2€ 8 Pe 
ue @ g § 2& € & 5 s 6 £ 8 ¢€ ic ae ooo se 3 os 2 oS 3 

6 o § rt 3B 8 4S es 3 oe & 5 2 2 2 
¢ § < & 5 Eo OF 2 s 4 s 
oO | a Go 


Extended Data Fig. 1| Standardized identification and purification of 
structural cells across 12 organs. a, Cell-type identification and cell-sorting 
scheme (top row) with representative flow cytometry plots (selected fromn=4 
independent biological replicates) in one representative organ (brain) under 
homeostatic conditions (bottom row). b, Representative plots (selected from 
n=4 independent biological replicates) for gating steps 4-6 of the 
standardized cell-type identification and cell-sorting scheme (a) across the 12 
organs under homeostatic conditions. c, Relative frequencies of structural cell 


types among non-haematopoietic (CD45 ) cells across 12 organs, for cell 
suspensions obtained by standardized organ dissociation. d, Relative 
frequencies of structural cell types among non-haematopoietic (CD45) cells 
across three organs, for cell suspensions obtained by either standardized 
organ dissociation or organ-specific dissociation protocols. Shown are mean 
and s.e.m. values. Sample size: n=4 (c) and n=3 (d) independent biological 
replicates. 
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Extended Data Fig. 2| Surface marker profiling of structural cells under 
homeostatic conditions. a, Gating strategy for the flow cytometry-based 
validation of the structural cell sorting scheme. Identification of structural 
cells starts with gating for intact cells (1), single cells (2), live cells (3) and 
non-haematopoietic cells (4). From the resulting non-haematopoietic (CD45 ) 
cell population, potential epithelial cells (5.1) are gated for epithelial cell 
markers (5.2). Similarly, potential endothelial cells and fibroblasts (6.1, 6.2) are 
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gated for endothelial cell markers (6.3) and fibroblast markers (6.4). b, Relative 
frequencies of potential structural cell types based on gates 5.2,6.3 and 

6.4 (froma), comparing the selected markers with alternative markers. 

c, Expression of the selected surface markers of structural cell types (top row) 
and potential alternative markers for cells gated as in Extended Data Fig. la. 
Shown are mean ands.e.m. values. Sample size (all panels): n=3 independent 
biological replicates. 
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Extended Data Fig. 3| Comparison of the structural cell transcriptomesto 
published reference data. a, Overlap of the identified cell-type-specific and 
organ-specific marker genes (derived from the RNA-seq experiments inthe 
current study) with tissue-specific gene sets froma microarray-based 
expression atlas (two-sided Fisher’s exact test with multiple-testing 
correction). b, Gene expression across cell types and organs (from the current 
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study) aggregated across marker genes of structural cell clustersina single-cell 
RNA-seqatlas of the mouse”. c, Gene expression across cell types and organs 
(from the current study) plotted for a manually curated list of commonly used 
markers of structural cells. d, Hierarchical clustering of structural cells across 
cell types and organs based on the transcriptome profiles from the current 
study. Sample size (all panels): n =3 independent biological replicates. 
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Extended Data Fig. 4| Inference of cell-cell interactions across celltypes 
and organs. a, Enrichment analysis for potential cell-cell interactions 


between structural cells and haematopoietic immune cells, based on gene 


expression of known receptor-ligand pairs (two-sided Fisher’s exact test with 


multiple-testing correction). For each combination of one structural cell type 


and one haematopoietic immune cell type, the analysis assesses whether all 


pairs of marker genes between the two cell types are enriched for 

annotated receptor-ligand pairs. b, Differently expressed genes across 
cell types and organs, based ona manually curated list of receptors and 
ligands (Supplementary Table 4). Sample size (all panels): n=3 independent 
biological replicates. 
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Extended Data Fig. 5 | Analysis ofimmune gene expression among 
structural cellsin an independent dataset. a, Relative frequencies of 
single-cell transcriptomes classified as endothelium, epithelium and 
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c, Expression of immune gene signatures in haematopoietic immune cells 
according to the Tabula Muris dataset, normalized as inb. d, Expression of 
selected immune genes in structural cells and in haematopoietic immune 


fibroblasts in selected organs according to the Tabula Muris dataset”’. cells according to the Tabula Muris dataset. Sample size: n =7 (all panels) 


b, Expression of immune gene signatures in structural cells according to the 


Tabula Muris dataset, jointly normalized across all plots (for comparability). 


independent biological replicates, comprising 4 male and 3 female mice. 
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Extended Data Fig. 6| Analysis of transcription regulation in structural 
cells. a, Multidimensional scaling analysis of the similarity of chromatin 
profiles across cell types, organs and replicates based on ATAC-seq (top) and 
H3K4me2 ChIPmentation (bottom). b, Correlation of chromatin profiles across 
cell types and organs for ATAC-seq (left) and H3K4me2 ChIPmentation (right). 
c, Transcriptional regulators of the inferred gene-regulatory network for 


structural cells, arranged by similarity using multidimensional scaling. d, Motif 
enrichment for transcriptional regulators among differential chromatin peaks, 
shown separately for each regulator (one-sided hypergeometric test with 
multiple-testing correction). e, Gene expression of the transcriptional 
regulators across cell types and organs (genes discussed in the text are in bold). 
Sample size (all panels): n= 2 independent biological replicates. 
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Extended Data Fig. 7 | Detection and analysis of genes with unrealized 
epigenetic potential. a, Scatterplot showing the correlation between 
chromatin accessibility in promoter regions and the corresponding gene 
expression levels in structural cells across cell types and organs. Genes with 
significant unrealized epigenetic potential (calculated as the difference 


between normalized ATAC-seq and RNA-seq signals) are highlighted in blue. 
b, Enrichment of immune-related gene sets among the genes with unrealized 
epigenetic potential (two-sided Fisher’s exact test with multiple-testing 
correction). Sample size (all panels): n= 2 independent biological replicates. 
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Extended Data Fig. 8 | Standardized identification and purification of (selected from n=3 independent biological replicates) for gating steps 4-6 
structural cells after LCMV infection. a, Cell-type identification and of the standardized cell-type identification and cell-sorting scheme (a) across 
cell-sorting scheme (top row) with representative flow cytometry plots the 12 organs after LCMV infection. c, Change in the relative frequency of 
(selected fromn=3 independent biological replicates) in one representative structural cells after LCMV infection. Sample size (all panels): n =3 independent 


organ (brain) after LCMV infection (bottom row). b, Representative plots biological replicates. 
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Extended Data Fig. 9 | Analysis of differential gene expressionin response 
to LCMV infection. a, Number of differentially expressed genes in structural 
cells after LCMV infection (this includes not only immune genes but also genes 
associated with the substantial organ-specific tissue damage and other direct 
and indirect effects of LCMV infection). b, Correlation of the observed changes 
in gene expression after LCMV infection across cell types and organs. 

c, Organ-specific viral load at day 8 of LCMV infection, measured by qPCRin 
whole-tissue samples collected from each organ (without FACS purification of 
individual cell types). Five reference genes were used for normalization and 
results were ranked across organs, to make the analysis robust towards 
tissue-specific differences in the expression of these housekeeping genes. 
However, the experimental results do not support an absolute quantification 


of viralload in each organ nor do they account for differences in the relative 
frequencies of cells that are susceptible to LCMV infection across organs. 

d, Scatterplot illustrating the low correlation between the activated epigenetic 
potential and the measured viral load across cell types and organs. e, f, Network 
analysis (e) and enrichment analysis (f) of potential cell-cellinteractions 
between structural cells and haematopoietic immune cells, inferred from gene 
expression of known receptor-ligand pairs after LCMV infection (two-sided 
Fisher’s exact test with multiple-testing correction). For each combination of 
one structural cell type and one haematopoietic immune cell type, the analysis 
assesses whether all pairs of marker genes between the two cell types are 
enriched for annotated receptor-ligand pairs. Sample size (all panels): n=3 
independent biological replicates. 
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Extended Data Fig. 10| Visualization of differential gene expressionin types, organs and cytokines (two-sided linear model with multiple-testing 
response to in vivo cytokine treatments. The heat map displays changes in correction). Sample size: n= 3 independent biological replicates. 
the expression of genes associated with immune functions, plotted across cell 
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Extended Data Fig. 11| Analysis of differential gene expression in response known receptors involved in the response to the individual cytokine 
toin vivo cytokine treatments. a, Number of differentiallyexpressed genesin treatments, plotted across celltypes and organs under homeostatic 
response to the individual cytokine treatments. b, Gene expression for the conditions. Sample size (all panels): n =3 independent biological replicates. 
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n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


[| A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


O A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
“1 Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


[| Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection Flow cytometry data were collected with DB FACSDiva (version 8.0.1) and sort-purification of cells was performed with Summit (version 
6.3.1.16945; MoFlo Astrios cell sorter) or Sony Cell Sorter (version 2.1.5, SH800 cell sorter) software. For determining optimal enrichment 
cycles during ATAC-seq and ChiPmentation library preparation, qPCR data were collected using Bio-Rad CFX Manager (version 3.1). 


Data analysis The following software packages were used for data analysis: Flow cytometry data were analyzed with FlowJo (Version 10.5.3 Tree Star) 
software. Raw sequencing data were processed using trimmomatic (version 0.32), HISAT2 (version 2.1.0), bowtie2 (version 2.2.4), and 
MACS (version 2.7.6). Integrated data analysis was performed in R (version 3.2.3), based on the packages ChiPpeakAnno (3.4.6), 
data.table (1.11.4), DiffBind (1.16.3), doMC (1.3.5), fmsb (0.6.3), foreach (1.4.4), gdata (2.18.0), GenomicRanges (1.22.4), ggplot2 (2.2.1), 
gplots (3.0.1), grid (3.2.3), gridExtra (2.3), igraph (1.1.2), limma (3.26.9), LOLA (1.0.0), Matrix (1.2.14), pheatmap (1.0.10), project.init 
(0.0.1), RColorBrewer (1.1.2), readxl (1.3.1), ROCR (1.0.7), Rsamtools (1.22.0), scales (0.5.0), and WriteXLS (5.0.0). Inkscape (Version 0.92) 
was used for data visualization and figure preparation. The analysis source code underlying the final version of the paper is available from 
the Supplementary Website (http://structural-immunity.computational-epigenetics.org). 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Raw and processed sequencing data (RNA-seq, ATAC-seq, H3K4me2 ChiPmentation) are available from the NCBI Gene Expression Omnibus (GEO) repository 
(accession number: GSE134663). In addition, the data are provided as an online resource on a supplementary website (http://structural-immunity.computational- 


epigenetics.org), which includes links to the raw and processed data, further analysis results, and genome browser tracks for interactive visualization of the RNA- 
seq, ATAC-seq, and ChiPmentation profiles. 
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Sample size Sample size was based on established standards in the field, including guidelines and SOPs from the Immgen project. 


Data exclusions To avoid potential biases due to contaminating RNA from hematopoietic immune cells (which may be released from dying cells or as the result 
of potential impurities during FACS purification of the structural cell populations), we screened all transcriptome profiles for gene expression 
signatures associated with hematopoietic immune cells. Samples with detectable contamination were automatically discarded, and 
replacement samples were generated until three uncontaminated samples were available for each cell type in each organ. These exclusion 
criteria were not pre-established; rather, they were devised based on the data for the first sample batches and consistently applied thereafter. 


Replication Transcriptome profiling was done in three biologically independent experiments. Chromatin accessibility mapping by ATAC-seq was done in 
two biologically independent experiments. Epigenome mapping by ChlPmentation was done in two biologically independent experiments, 
with two exceptions: for endothelium from lymph node and fibroblasts from thymus, only one high-quality sample could be obtained. 


Randomization — Processing of samples from the various organs and cell types did not follow any particular order. The dataset was monitored for evidence of 
batch effects, but these analyses did not identify any technical covariates that should be controlled for using statistical methods. 


Blinding Because there were no pre-defined sample groups during data collection, blinding was not applicable for this study. 
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Materials & experimental systems Methods 

n/a | Involved in the study n/a | Involved in the study 
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[| Eukaryotic cell lines r | Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


Human research participants 


Clinical data 


Antibodies 


Antibodies used CD16/CD32 blocking antibody (clone 93, BioLlegend Cat# 101320, Lot# B264827, dilution 1:200) 
CD31 (clone MEC13.3, AF488, BioLegend Cat# 102514, Lot# B282351, dilution 1:100 
CD45 (clone 30-F11, PerCP-Cy5.5, BioLlegend Cat# 103132, Lot# B295198, dilution 1:300) 
CD90.2 (clone 30-H12, AF700, BioLegend Cat# 105319, Lot# B260794, dilution 1:100 
CD106 (clone 429, AF647, BioLegend Cat# 105711, Lot# B269055, dilution 1:100) 
CD140a (clone APAS, BV605, BD Bioscience Cat# 704380, Lot# 9284670, dilution 1:100) 
CD144 (clone BV13, BV421, BioLegend Cat# 138013, Lot# B292996, dilution 1:100) 
CD324 (clone DECMA-1, APC-Cy7, BioLegend Cat# 147314, Lot# B274948, dilution 1:150) 

Epcam (CD326) (clone G8.8, PE-Cy7, BioLegend Cat# 118216, Lot# B246071, dilution 1:150) 

AdCAM1 (clone MECA-367, BV421, BD Bioscience Cat# 742812, Lot# 9284682, dilution 1:100) 

Lymphotoxin beta receptor (clone 5G11, APC, BioLegend Cat# 134408, Lot# B283402, dilution 1:100) 

LYVE1 (clone ALY7, eFlour 660, ThermoFisher Cat# 50-0443-82, Lot# 2107832, dilution 1:100) 

Podoplanin (gp38) (clone 8.1.1, PE, BioLegend Cat# 127408, Lot# B296330, dilution 1:150) 

Ter119 (clone TER119, PerCP-Cy5.5, BioLegend Cat# 116228, Lot# B291958, dilution 1:200) 

Dead cells were stained by adding Zombie Red Fixable Viability Dye (BioLegend Cat# 423110, Lot# B225035, dilution 1:1000) or 
Zombie Aqua Fixable Viability Dye (BioLlegend Cat# 423102, Lot# B274475, dilution 1:1000) 


D16/CD32 blocking antibody: Quality tested by flow cytometric analysis according to vendor's website. 

D31 (clone MEC13.3): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining 
f mouse splenocytes) shown on vendor's website. 

D45 (clone 30-F11): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining of 
mouse splenocytes) shown on vendor's website. 

CD90.2 (clone 30-H12): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining 
of mouse thymocytes) shown on vendor's website. 

CD106 (clone 429): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining of 
mouse bone marrow myeloid cells) shown on vendor's website. 

CD140a (clone APAS): Quality tested by flow cytometric analysis of antibody surface-stained cells according to vendor's website. 
CD144 (clone BV13): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining of 
mouse endothelial cells bEnd.3) shown on vendor's website. 

CD324 (clone DECMA-1): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot 
(staining of MDCK epithelial cell line) shown on vendor's website. 

Epcam (clone G8.8): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining of 
mouse thymic epithelial stromal cell line TE-71) shown on vendor's website. 

MAdCAM1 (clone MECA-367): Quality tested by flow cytometric analysis of antibody surface-stained cells according to vendor's 
website. 

Lymphotoxin beta receptor (clone 5G11): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow 
cytometry plot (staining of mouse bone marrow cells) shown on vendor's website. 

LYVE1 (clone ALY7): Quality tested by flow cytometric analysis of mouse tissue, flow cytometry plots shown on vendor's website. 
Podoplanin (clone 8.1.1): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot 
(staining of mouse thymic epithelial stromal cell line TE-71) shown on vendor's website. 

Ter119 (clone TER119): Quality tested by flow cytometric analysis of antibody surface-stained cells, flow cytometry plot (staining 
of mouse bone marrow cells) shown on vendor's website. 

Zombie Red Fixable Viability Dye: Quality tested by flow cytometric analysis, flow cytometry (staining of one day old mouse 
splenocytes) shown on vendor's website. 

Zombie Aqua Fixable Viability Dye: Quality tested by flow cytometric analysis, flow cytometry (staining of one day old mouse 
splenocytes) shown on vendor's website. 
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Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals C57BL/6J mice were bred and maintained under specific pathogen free conditions at the Institute of Molecular Biotechnology 
(IMBA) of the Austrian Academy of Sciences in Vienna (Austria). Experiments were performed under specific pathogen free 
conditions at the Anna Spiegel Research Building, Medical University of Vienna (Austria). Age-matched male mice (8 to 13 weeks 
old) were used for the characterization under homeostatic conditions, for the viral infection model, and for in vivo cytokine 


treatments. 
Wild animals The study did not involve wild animals. 
Field-collected samples The study did not involve samples collected in the field. 
Ethics oversight All mouse experiments were done in compliance with the respective animal experiment licenses (BMWFW-66.009/0199-WF/ 


V/3v/2015 and BMWFW-66.009/0361-WF/V/3b/2017) approved by the institutional ethical committees and the institutional 
guidelines at the Department for Biomedical Research of the Medical University of Vienna. 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


ChIP-seq 


Data deposition 


Confirm that both raw and final processed data have been deposited in a public database such as GEO. 


Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks. 


Data access links https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE134663 
May remain private before publication. 


Files in database submission Raw read counts; normalized log2-transformed counts-per-million 
Genome browser session http://genome.ucsc.edu/cgi-bin/hgTracks ?db=mm10&hubClear=http://medical-epigenomics.org/papers/krausgruber2019/ 
(e.g. UCSC) hub.txt&hgt.labelWidth=30&pix=1300&textSize=12&position=chr16:91,542,434-91,570,261 
Methodology 
Replicates Two biological replicates 
Sequencing depth 12 samples per HiSeq 3000/4000 lane were sequenced, with ~360 million reads per lane, thus aiming for 30 million reads 


per samples. 16 million to 50 million reads per sample were obtained. Sequencing was done with 50 bp single-end reads. 


Antibodies 


Peak calling parameters 


Data quality 


Software 


Flow Cytometry 


H3K4me2 (clone AW30, Merck Cat#04-790) 


Peaks were called using MACS2 (version 2.7.6) function callpeak, comparing samples to inputs controls. The input controls 
were obtained by pooling sort-purified cells (endothelium, epithelium, fibroblast) in equal amounts for each organ. 


Data quality including peak height and background noise was assessed by visual inspection using the UCSC Genome Browser. 


Peak lists were aggregated to a consensus peak list using the function reduce of the package GenomicRanges (version 
1.22.4) in R. Consensus peaks overlapping with blacklisted regions (downloaded from http://mitra.stanford.edu/kundaje/ 
akundaje/release/blacklists/mm10-mouse/) were discarded. Quantitative measurements were obtained by counting reads 
within consensus peaks using the function summarizeOverlaps from the GenomicAlignments (version 1.6.3) package in R. 


Plots 


Confirm that: 


Methodology 


Sample preparation 


The axis labels state the marker and fluorochrome used (e.g. CD4-FITC). 
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers). 
All plots are contour plots with outliers or pseudocolor plots. 


A numerical value for number of cells or percentage (with statistics) is provided. 


Standardized sample collection and organ dissociation 

Different surface markers and sorting schemes were previously used to purify endothelium, epithelium, and fibroblasts in 
individual organs, while our systematic comparison required standardized cell purification to be informative. We therefore 
tested multiple surface markers across organs and optimized a sorting scheme that produced consistent results across all 12 
investigated organs, while excluding cell types that were detectable only in one or a few tissues (most notably lymphatic 
endothelial cells). The experimental workflow followed the recommendations of the Immunological Genome Project regarding 
sample collection schedule, antibody staining, and sample pooling. At least three same-sex littermate mice were pooled for each 
biological replicates for homeostatic conditions and for LCMV infection, respectively. For the in vivo cytokine treatments, we 
used individual mice as biological replicates to reduce the total number of mice. Standardized organ harvesting and dissociation 
protocols were established to obtain single-cell suspension for subsequent cell purification by fluorescence activated cell sorting 
(FACS). The same digestion solution was used for all organs, in order to avoid organ-specific technical confounders. 

Brain. Following decapitation, the skull was cut longitudinally with scissors, and the cranium was opened with tweezers. Both 
brain hemispheres were carefully collected and placed into cell culture dishes containing cold PBS supplemented with 0.1% BSA 
(PBS + BSA). White matter was manually removed, the tissue shredded with scissors and added to a 50 ml tube containing 15 ml 
cold Accumax (Sigma-Aldrich) digestion solution and incubated for 45 min at 37 °C while shaking at 200 rpm. Remaining tissue 
fragments were processed with a Dounce homogenizer (Sigma-Aldrich) followed by centrifugation at 300 g for 5 min at 4 °C. 
Myelin was removed by using density gradient centrifugation. Cells were recovered at the interface between a 80% Percoll layer 
and a 30% Percoll layer and washed in PBS + BSA to remove excess Percoll. 

Caecum, large intestine, small intestine. Luminal contents were removed. The organs were cut longitudinally with scissors, rinsed 
several times in PBS + BSA to remove mucus, then cut into 0.5 cm pieces and placed in 50 ml tubes containing 20 ml pre-warmed 
(37 °C) RPMI containing 10% FCS and 5 mM EDTA (RPMI + FCS + EDTA). Samples were incubated for 25 min at 37 °C in a shaking 
incubator at 200 rpm. Supernatant was collected, samples were resuspended once again in 20 ml pre-warmed RPMI + FCS + 
EDTA, and incubated for 25 min at 37 °C in a shaking incubator at 200 rpm. These wash steps were performed to dissociate 
epithelial cells. After the second incubation, supernatants were collected and combined, followed by digestion of the samples in 
15 ml cold Accumax for 45 min at 37 °C while shaking at 200 rpm. Remaining tissue fragments were processed with a Dounce 
homogenizer. Organ homogenates were combined with the epithelial cell fractions, filtered twice through a 100 um cell strainer 
and washed in cold PBS + BSA.idney, lung, spleen, thymus. Organs were rinsed with cold PBS + BSA and shredded with scissors. 
Tissue fragments were placed into 50 ml tubes containing 20 ml cold Accumax digestion solution and incubated for 45 min at 37 
°C while shaking at 200 rpm. Remaining tissue fragments were processed with a Dounce homogenizer and filtered through a 100 
um cell strainer. After centrifugation, 2 ml cold ACK lysis buffer (Thermo Fisher Scientific) was added for 3 min to lyse red blood 
cells and the reaction stopped by adding 20 ml of cold PBS + BSA. Supernatants were filtered twice through 100 um cell strainer 
and washed once to remove residual ACK lysis buffer. 

Liver. Three lobes were removed, rinsed with cold PBS + BSA, and shredded with scissors. Tissue fragments were placed into a 50 
ml tube containing 20 ml cold Accumax digestion solution and incubated for 45 min at 37 °C while shaking at 200 rpm. 
Remaining tissue fragments were processed with a Dounce homogenizer and filtered through a 100 um cell strainer. 
Hepatocytes were removed using density gradient centrifugation. Cells recovered at the interface between an 80% Percoll layer 
and a 30% Percoll layer were washed in PBS + BSA to remove excess Percoll. 

Lymph nodes. Cervical, axillary, and inguinal lymph nodes were combined, carefully pinched with tweezers, and rinsed several 
times with cold PBS + BSA to release hematopoietic cells. Tissue fragments were placed into a 50 ml tube containing 10 ml cold 
Accumax digestion solution and incubated for 45 min at 37 °C while shaking at 200 rpm. Remaining tissue fragments were 
processed with a Dounce homogenizer and filtered twice through a 100 um cell strainer. 

Skin. Ears were harvested at the base, and the subcutaneous fat layer scrapped off with a scalpel. Tissue fragments were then 
shredded with scissors, placed into 50 ml tubes containing 15 ml of Accumax digestion solution, and incubated for 45 min at 37 ° 
C while shaking at 200 rpm. Remaining tissue fragments were processed with a Dounce homogenizer and filtered twice through 
a 100 um cell strainer. 
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Organ-specific sample collection and organ dissociation 

Large intestine. The organ was removed and processed as described. Briefly, luminal contents were removed, the large intestine 
cut longitudinally with scissors, rinsed several times in PBS + BSA to remove mucus, then cut into 0.5 cm pieces and placed in 50 
ml tubes containing 20 ml pre-warmed (37 °C) RPMI containing 10% FCS and 5 mM EDTA (RPMI + FCS + EDTA). Samples were 
incubated for 40 min at 37 °C in a shaking incubator at 200 rpm after. Supernatant was collected, samples resuspended once 
again in 20 ml pre-warmed RPMI + FCS + EDTA, and incubated for 20 min at 37 °C in a shaking incubator at 200 rpm. These wash 
steps were performed to dissociate epithelial cells. After the second incubation, supernatants were collected and combined, 
followed by incubating samples in RPMI containing 10% FCS and 15 mM HEPES (RPMI + FCS + HEPES) for 10 min at room 
temperature. Supernatant was discarded and samples digested in RPMI + FCS + HEPES containing 100 U/ml collagenase from 
Clostridium histolyticum (Sigma) for 1 hour at 37 °C in a shaking incubator at 200 rpm. Remaining tissue fragments were 
processed with a Dounce homogenizer. Organ homogenates were combined with the epithelial cell fractions, filtered twice 
through 100 um cell strainers and washed in cold PBS + BSA. 

Lung. The organ was removed and processed as described. Briefly, the organ was cut into 0.5 cm pieces and placed in 
gentleMACS C tubes (Miltenyi) containing 160 U/mL collagenase type 1 (Gibco) and 12 U/mL DNAse 1 (Sigma) in RPMI containing 
5% FCS, and dissociated using a gentleMACS Dissociator (Miltenyi; program m_lung_01). After incubation at 37°C for 35 minina 
shaking incubator at 170 rpm, digested samples were homogenized using the gentleMACS Dissociator (program m_lung_02). 
Subsequently, cell suspensions were filtered through 70 UM cell strainers and centrifuged for 5 min at 4 °C at 300g. After 
centrifugation, 1 ml cold ACK lysis buffer (Thermo Fisher) was added for 5 min to lyse red blood cells and the reaction stopped by 
adding 20 ml of cold PBS + BSA. Supernatants were filtered twice through 100 um cell strainer and washed once to remove 
residual ACK lysis buffer. 
Liver. Mice were anesthetized (ketamine/xylazine 1:3, 0.1ml/10g mouse, Vetoquinol) before cannulation of the liver and 
dissociation using a two-step perfusion protocol. Briefly, the liver was perfused first with 20 mL HBSS (Gibco) containing 0.5 mM 
EGTA (Sigma) and afterward with 20 mL L15 medium (Gibco) containing 40 mg/L liberase (Roche) at a rate of 5 mL/min. Next, 
the liver was removed, placed in a Petri dish with 10 ml of the same liberase-containing medium, followed by removal of the liver 
capsule. Hepatocytes were removed from the resulting cell suspension using density gradient centrifugation. Cells recovered at 
the interface between an 80% Percoll layer and a 30% Percoll layer were washed in PBS + BSA to remove excess Percoll. 
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Flow cytometry and FACS 
Single-cell suspensions were washed once with PBS containing 0.1% BSA and 5 mM EDTA (PBS + BSA + EDTA). Cells were then 
incubated with anti-CD16/CD32 (clone 93, BioLegend) to prevent nonspecific binding. Single-cell suspensions were then stained 
with different combinations of antibodies against CD45 (PerCP-Cy5.5, clone 30-F11), TER-119 (PerCP-Cy5.5, clone TER-119), 
podoplanin (PE, clone 8.1.1), EpCAM (Pe-Cy7, clone 8.8), lymphotoxin beta receptor (APC, clone 5G11), CD31 (FITC, clone 
MEC13.3), CD90.2 (AF700, clone 30-H12), CD106 (AF647, clone 429), CD144 (BV421, clone BV13), CD324 (APC-Cy7, clone 
DECMA-1) (all from BioLegend), CD140a (BV605, clone APAS, BD Bioscience), MAdCAM1 (BV421, clone MECA-367, BD 
Bioscience), and LYVE1 (eFlour 660, clone ALY7, Thermo Fisher Scientific) for 30 min at 4 °C, followed by two washes with PBS + 
BSA + EDTA. Dead cells were stained either by adding Zombie Red Fixable Viability Dye or Zombie Aqua Fixable Viability Dye 
(both from BioLegend) immediately prior to flow cytometry characterization or cell sorting. For flow cytometry, cells were 
acquired with an LSRFortessa (BD Biosciences) cell analyzer using the outlined gating strategies (Extended Data Fig. 1a, 2a, 8a). 
For FACS, cells were sort-purified with a MoFlo Astrios (Beckman Coulter) or SH800 (Sony) using the outlined gating strategies 
(Extended Data Fig. 1a, 8a). Data analysis was performed using the FlowJo software (Version 10.5.3, Tree Star). 


Instrument For flow cytometry, cells were acquired with an LSRFortessa (BD Biosciences) cell analyzer. For FACS, cells were sort-purified with 
a MoFlo Astrios (Beckman Coulter) or SH800 (Sony). 


Software Data analysis was performed with FlowJo (Version 10.5.3 Tree Star) software. 


Cell population abundance As expected, structural cells varied in abundance across organs (Figure 1b, Extended Data Fig. 1, and Extended Data Fig. 8). It 
was not feasible to routinely perform post-purity checks on the sorted cell populations due to their low and varied abundance. 


Gating strategy The gating strategy is depicted in Extended Data Fig. 1a, Extended Data Fig. 2a, and Extended Data Fig. 8a. 


Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information. 
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Mammalian cells reorganize their proteomes in response to nutrient stress through 
translational suppression and degradative mechanisms using the proteasome and 
autophagy systems’”. Ribosomes are central targets of this response, as they are 


responsible for translation and subject to lysosomal turnover during nutrient 
stress* °. The abundance of ribosomal (r)-proteins (around 6% of the proteome; 10’ 
copies per cell)°’ and their high arginine and lysine content has led to the hypothesis 
that they are selectively used as a source of basic amino acids during nutrient stress 
through autophagy*’. However, the relative contributions of translational and 
degradative mechanisms to the control of r-protein abundance during acute stress 
responses is poorly understood, as is the extent to which r-proteins are used to 
generate amino acids when specific building blocks are limited’. Here, we integrate 
quantitative global translatome and degradome proteomics? with genetically 
encoded Ribo-Keima’ and Ribo-Halo reporters to interrogate r-protein homeostasis 
with and without active autophagy. In conditions of acute nutrient stress, cells 
strongly suppress the translation of r-proteins, but, notably, r-protein degradation 
occurs largely through non-autophagic pathways. Simultaneously, the decrease in 
r-protein abundance is compensated for by a reduced dilution of pre-existing 
ribosomes and a reduction in cell volume, thereby maintaining the density of 
ribosomes within single cells. Withdrawal of basic or hydrophobic amino acids 
induces translational repression without differential induction of ribophagy, 
indicating that ribophagy is not used to selectively produce basic amino acids during 
acute nutrient stress. We present a quantitative framework that describes the 
contributions of biosynthetic and degradative mechanisms to r-protein abundance 
and proteome remodelling in conditions of nutrient stress. 


r-Protein stoichiometry is controlled by translation and assembly mech- 
anisms as well as by proteasomal degradation of excess r-proteins’™, 
whereas autophagy may facilitate the turnover of ribosomes en mass>”. 
Previous studies examining the effect of amino acid withdrawal or 
mTOR inhibition on r-protein homeostasis in mammalian cells have 
primarily focused on autophagic r-protein turnover, using either immu- 
noblotting to measure r-protein abundance for specific subunits‘ or 
Ribo-Keima to measure autophagic flux®. However, a global view of 
how cells regulate net ribosome abundance upon nutrient stress is 
lacking, as autophagy is only one of several components of the ribo- 
some homeostasis system. To uncover the mechanisms that control 
r-protein levels during nutrient stress, we developed a quantitative 
framework for analysing r-protein abundance, synthesis, turnover 
and subcellular partitioning, using methods applicable to ensemble 
or single-cell measurements (Extended Data Fig. 1a). 

We first used quantitative proteomics to examine the net balance 
of r-proteins after the acute withdrawal of amino acids or inhibition 
of mTOR with a small-molecule inhibitor, Torin1 (Tor1) (Fig. la-e, 
Extended Data Fig. Ib-d). HEK293 and HCT116 cells with or without the 


ATG8-conjugation (ATGS) or signalling (RB1CC1, also known as FIP200) 
arms of the autophagy system were subjected to amino acid withdrawal 
or treatment with Tor] for 10 or 24 h, followed by 11-plex tandem mass 
tagging (TMT)-based proteomics (Fig. 1b). We also mined our published 
dataset using HEK293T (293T) wild-type, ATG7’ and RBICCI “ cellsthat 
were subjected to the same experimental pipeline” (Fig. 1c, Extended 
Data Fig. 1b). As expected, both treatments resulted ina reduction inthe 
levels of autophagy cargo receptors (GABARAPL2, LC3B, SQSTM1 and 
TEX264) and endoplasmic reticulum (ER) proteins, which was dependent 
onATG5 or ATG7 and RBICC1” (Extended Data Fig. lb-e, Supplementary 
Table 1). By contrast, plots of the log,-transformed ratio (cells treated with 
nutrient stressors/untreated cells) versus the -log,,.-transformed Pvalue 
for at least 70 out of 80 r-proteins revealed only a modest (4.6-11.6%) 
reduction of r-protein levels with both forms of nutrient stress; this was 
largely independent of autophagy (Fig. 1c—f, Extended Data Fig. lb-d), 
although there was some variation in the extent of decrease between 
individual r-proteins (Fig. 1g, Extended Data Fig. 1f). The reduction in 
abundance for several r-proteins was not observable using quantitative 
immunoblotting (Extended Data Fig. 1g-k). 


‘Department of Cell Biology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA. ?Present address: Department of Biochemistry, University of Wurzburg, Wurzburg, Germany. “These 
authors contributed equally: Heeseon An, Alban Ordureau. “e-mail: wade_harper@hms.harvard.edu 


Nature | Vol 583 | 9 July 2020 | 303 


Article 


a b 


293T, HEK293 or HCT116 
I 
WT RB1CC1~ ! 
ATGS- 
ge Se 
r-protein BD sIcr7- a—_ 


synthesis Lz, 
se 
oo. = 
* 


= UT (x3) -AA (x4) 
Assembly Y 


Net balance of ribosomes 


Tor1 (x4) 


Dilution by Perens 
cell division | Quantification of total proteome 
D dati UT Tort -AA UT Tort -AA UT Tort -AA 
eres wr ATG7“- — RBICC1+ 
Quantification of >70 r-proteins 
c 298T -AA (10 h) d HEK293 —AA (10 h) 


o 
a 


Oo Oo 
= = 
g Sy 
“4 * 
8 38 : 
v ! =. ATG’ 
2 2 : 
x : 1 
RB1CC1~- 
oe Ee ¥ 0 WT 
27 E— WT ATG7~- AN 38" 
ee a a 
logy(-ANW/U 1 7 
ogg AMUT) log,(-AA/UT) 2 
e HCT116 -AA (10 h) 
HUT «-AA Tort 
> ae 293T HEK293 HCT116 
4 = 
oy 2 
| : 2 
£3 Bs 2 
a * S 
o ee fig 
B82 
T a 
 RB1CC17- 
1 (single shot) 
Be) an Fr av So Ss oso Ss 
0 ae Fee FE EEE 
re E+ 4 EL E+ ¢ 
1S 4 : Wr (single shot) 3 5 - 3 ra 3 ‘9 S 
(single shot) Bias. R 2 8 
0.5 <x a xt x= 5 
log,(-AA/UT) 1 -_ ia iva 
g 145 AA starvation (10 h) 14 Tor1 treatment (10 h) 
Pearson r = 0.87 Pearson r = 0.77 
Ss oc 
@ 1.04 4 = 1.07 
5 5 
3 = 
0.95 6 0.975 
li 
7 , 
bs K 
2 S 
£ os 4 
= 5 0.8 
RPS7 RPS7 
0.7 T T T 1 0.7 T T , 1 
0.7 0.8 0.9 1.0 11 0.7 0.8 0.9 1.0 1.1 


293T WT (-AA/UT ratio) 293T WT (Tor1/UT ratio) 


Fig. 1| Reductioninr-protein abundance during nutrient stress is largely 
autophagy-independent. a, Factors affecting the net balance of ribosomes 
during nutrient stress. b, The 11-plex TMT pipeline to measure ribosome 
abundance during nutrient stress with or without active autophagy. 
Normalized total cell extracts were processed for eight 11-plex TMT mass 
spectrometry (TMT-MS) experiments. -AA, withdrawal of amino acids; UT, 
untreated; WT, wild type. c-e, Volcano plots of the -log,)-transformed P value 
versus the log,-transformed ratio of -AA cells/UT cells for 293T cells (WT, 
n=8,029 proteins; ATG7”, n= 8,373 proteins; RBICCI”, n=8,332 proteins) 
(c), HEK293 cells (WT, n=7,531 proteins; ATGS”~, n=7,504 proteins) (d) or 
HCT116 cells (WT, n=3,779 proteins; ATGS , n=3,761 proteins; RBICCI”, 
n=3,671 proteins) (e). Inc-e, n=3 (untreated) orn=4 (-AA) biologically 
independent samples. Pvalues were calculated by two-sided Welch’s t-test 
(adjusted for multiple comparisons); for parameters, individual P values and 
qvalues, see Supplementary Table 1. f, Mean ratio value (+s.d.) measured for at 
least 70 r-proteins treated as in b (numbers of proteins are shownin 
parentheses; n=3 (untreated) or n=4 (-AA and Tor1) biologically independent 
samples). The 293T data are froma previous study”. g, Relative abundance of 
individual r-proteins in 293T cells after either 10 h of amino acid withdrawal 
(left) or 10 h of treatment with Torl (right). For every condition, 48 r-proteins 
with an error range of less than + 10% are plotted. Data are mean +s.d. forn=3 
(untreated) or n=4 (-AA and Torl) biologically independent samples. See also 
Extended Data Fig. 1, Supplementary Table 1. 
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Ribo-Halo analysis of r-proteins 


The absence of an obvious autophagy-dependent loss of r-proteins 
(on the basis of proteomics with normalization for the total protein 
input) created a paradox, as ribophagic flux has been observed with 
similar treatments using Ribo-Keima reporters>. Changes in r-protein 
abundance could bea result of multiple mechanisms, including transla- 
tional inhibition, degradation or effects on ribosome dilution owing to 
reduced cell division (Fig. 2a). To simultaneously examine the synthesis 
of newr-proteins, the fate of pre-existing r-proteins and the dilution 
of ribosomes through cell division, we fused a Halo cassette with the 
surface-exposed C termini of endogenous RPS3 and RPL29 genes in 
HCT116, 293T or HEK293 cells (hereafter called Ribo—Halo) (Extended 
Data Fig. 2a, b). Halo-fused proteins can be covalently labelled ina 
temporally controlled manner using distinct fluorescent ligands, which 
facilitates the analysis of single cells by flow cytometry or imaging 
and ensembles of cells using in-gel fluorescence. Halo tagging did not 
affect the abundance or translation rates of r-proteins, or the response 
to mTOR signalling (Extended Data Fig. 2c, d). We also optimized con- 
centrations, wash conditions and compensation protocols to ensure 
complete Halo labelling with individual red (tetramethylrhodamine; 
TMR) or green (R110) fluorescent ligands (Extended Data Fig. 2e-i; 
see Methods). 

We first measured the effect of mTOR inhibition (14 h) on the 
total abundance of RPS3-Halo and RPL29-Halo in single HCT116 
cells by flow cytometry using a 1-h pulse-chase-labelling experi- 
ment with a cell-permeable red Halo ligand (Fig. 2b). The mean 
fluorescence per cell was reduced by around 25% in both cell lines 
after Torl treatment (Fig. 2c, d). In parallel, we found that treatment 
with Torl led to a reduction of around 10% in cell diameter (about 
a 25% reduction in volume)”, as well as a reduction in the rates of 
cell division, thereby maintaining the density of RPS3 and RPL29 
molecules in single cells (Extended Data Fig. 2j, k). Thus, changes 
in the number and size of cells probably underlie the discrepancy 
between the effects of nutrient stress measured by Ribo-Halo and 
those measured by total proteomics, as the latter method uses a 
normalized total protein input. 

To simultaneously measure the dilution of pre-existing r-proteins 
and the synthesis of new r-proteins, HCT116 Ribo-—Halo cells were 
treated for 1h with red Halo ligand to irreversibly label pre-existing 
r-proteins, free ligand was rapidly washed out and cells were then 
incubated in rich medium with green Halo ligand for 8, 16 or 24 hto 
label newly translated r-proteins (Fig. 2e, Extended Data Fig. 21). As 
expected, the relative contribution of pre-existing RPS3 and RPL29 to 
the entire pool of these proteins in single cells decreased with time, 
reflecting dilution by cell division (Fig. 2e, Extended Data Fig. 21). Fifty 
percent of pre-existing r-proteins were observed at around 16 h, in 
accordance with independently measured rates of cell proliferation 
(Extended Data Fig. 2k). The total abundance of r-proteins per cell was 
maintained by the synthesis of newr-proteins (labelled with green Halo 
ligand), consistent with a doubling in ribosome number during one cell 
cycle. After inhibition of mTOR during labelling with green Halo ligand 
(Fig. 2f, Extended Data Fig. 2m), we observed a reduced dilution of 
pre-existing r-proteins at 8, 16 and 24 hin parallel with reduced cell divi- 
sion, and a reduction in the production of newly synthesized (green) 
r-proteins—apparently reflecting translational inhibition. A reduc- 
tion in the total number of r-proteins (red and green) by 10-30% was 
consistent with single-ligand labelling. In addition, areduction inthe 
synthesis of newr-proteins and the dilution of pre-existing r-proteins 
was observed using imaging or in-gel fluorescence in Ribo—Halo cells 
after treatment with Tor] (14 h) (Fig. 2g, h, Extended Data Fig. 2n-q). 
Consistent with residual mTOR activity in HCT116 cells relative to 
293T cells, we observed differential repression of new RPS3-Halo by 
pulse-chase Halo labelling (40% versus 60%) (Extended Data Fig. 20, r). 
Notably, the abundance, dilution and new synthesis of r-proteins, and 
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Fig. 2| Density, synthesis and dilution of r-proteinsin single cells using 
Ribo-Halo. a, Contribution of dilution by cell division to the net ribosome 
balance during nutrient stress. b, Measurement of r-protein concentrationin 
single cells using one-colour Halo labelling. c, Normalized TMR signal in RPS3- 
Halo cells (>3 x 10°) and RPL29-Halo cells (>4 x 10°) incubated with or without 
200 mM Torl (14 h), followed by 1h TMR ligand treatment and flowcytometry. 
Representative of three independent experiments. d, Mean+s.d. of triplicate 
data fromc.e, Pulse-chase Ribo—Haloin HCT116 cells. Pre-existing RPS3-Halo 
was labelled with red TMR ligand (1h) and newly synthesized RPS3-Halo was 
labelled with green R110 ligand for 8, 16 or 24 h (top) before flowcytometry 
(frequency histogram, bottom left; around 1.4 x 10‘ cells analysed). Bottom 
right, mean values (+ s.d.) from the triplicate experiments. See Methods for 
details. f, Top, scheme for two-colour Ribo-Halo r-protein biogenesis and 


the changes in cell size after inhibition of mTOR (14 or 24 h), were unal- 
tered in cells lacking ATG7 or RB1CC1 (Extended Data Fig. 3a—n). Thus, 
the overall abundance of r-proteins in response to mTOR inhibition 
reflects a reduction in the synthesis of new r-proteins and a reduced 
dilution of pre-existing r-proteins, the magnitudes of which may vary 
with cell type. Autophagic turnover of r-proteins and ribosomal RNA 
(rRNA) is unlikely to provide a mechanism to reverse translational 
suppression and the reduction in cell growth under conditions of 
acute stress. 


Translatomics during nutrient stress 


We next developed a quantitative translatome approach to systemati- 
cally examine r-protein synthesis during nutrient stress using meta- 
bolic labelling with azidohomoalanine (AHA) (Fig. 3a, b). Activation 
of AHA by tRNA™*-synthetase allows the incorporation of azides into 
proteins“ that can then be functionalized through click chemistry, 
thereby allowing the measurement of ongoing translation using TMR 
alkyne for in-gel quantification or biotin alkyne for purification with 
streptavidin resin and TMT-based proteomics (AHA-TMT) (Fig. 3b). 
We optimized AHA labelling (250 pM) to minimize effects on mTOR 
activity while allowing sufficient incorporation of AHA (Extended 
Data Fig. 4a-g, Methods). As expected on the basis of ribosome 
profiling and the incorporation of *S-labelled Cys or Met’, overall 


dilution labelling with or without Torl1. Pre-existing RPS3-Halo was labelled 
with TMR ligand (1h), and newly synthesized RPS3-Halo with or without Torl 
(200 nM) was labelled with green R110 ligand. Bottom left, ratio of the 620-nm 
emission (red) intensity to the 550-nm emission (green) intensity plotted 
against cell populations. Bottom right, mean values (+s.d.) from the triplicate 
experiments for the 8h, 16 hand 24h pulse chases. g, Imaging of pre-existing 
red TMR-labelled RPS3-Halo and newly synthesized green R110-ligand-labelled 
RPS3-Halo with or without Torl (200 nM, 14h). Scale bars, 20 pm. h, In-gel 
fluorescence of RPS3-Halo as in f. Gels were then transferred for 
immunoblotting (IB) with a-RPS3. Experiments in g-h were repeated more 
than three times with similar results. The full immunoblot is shown in Extended 
Data Fig. 11f. See also Extended Data Figs. 2,3. For gel source data, see 
Supplementary Fig. 1. 


translation was reduced by 60-70% after inhibition of mTOR or with- 
drawal of amino acids (3 h), as measured by TMR labelling (Fig. 3c, 
Extended Data Fig. 4h). We then quantified the translation of more 
than 8,200 proteins by AHA-TMT proteomics. The vast majority of 
proteins were translationally repressed after inhibition of mTOR or 
withdrawal of amino acids (3 h)—a notable exception being elF4EBP1 
after mTOR inhibition?” (Fig. 3d, e, Extended Data Fig. 4i). The major- 
ity of r-proteins were translationally repressed beyond the median 
value, consistent with 5’ terminal oligopyrimidine motifs in their 
mRNAs|. Inhibition of mTOR repressed the translation of r-proteins 
more selectively than did the withdrawal of amino acids, despite 
similar proteome-wide levels of translational suppression (Extended 
Data Fig. 4h, j). In addition, we identified proteins for which transla- 
tion was more selectively suppressed (group 1), largely unchanged 
(group 2) or differentially affected by the two treatments (groups 3 
and 4) (Extended Data Fig. 4k-o). Notably, the extent of translational 
suppression in ATG7” and RBICCI™“ 293T cells was comparable to 
that in wild-type cells (Fig. 3e, Extended Data Fig. 4p-r). Thus, global 
translational suppression of r-proteins contributes to the control of 
ribosome abundance during acute nutrient stress; this is not discern- 
ibly affected by the ability of cells to recycle amino acids through 
autophagy in acute conditions; and, finally, the extent of transla- 
tional suppression varies between individual r-proteins (Extended 
Data Fig. 4s). 
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Fig. 3 | Global translatome and degradome analysis during nutrient stress. 
a, Contribution of r-protein synthesis to the net ribosome balance during 


Tor1/UT ratio (12 h) 


UT/Tor1 ratio (12 h) 


nutrient stress. b, AHA-based translatomics to measure translational 


suppression during nutrient stress. The final steps areas follows: (1) western 
blot to check sample quality; (2) In-gel TMR fluorescence analysis; (3) biotin- 
streptavidin enrichment; and (4) TMT 10-plex mass spectrometry (MS) 
analysis. c, Extracts from 293T cells with or without amino acid withdrawal or 
Torl treatment inthe presence of Met or AHA (250 uM, 3 heach) were labelled 
with TMR and analysed by in-gel fluorescence (n= 3 biologically independent 
samples for UT, —AA and Torl; n=1 for Met). d, Extracts from 293T cells (as in b) 
were clicked with biotin before streptavidin enrichment and proteomics. Plot 
shows the -log,)-transformed P value versus the log,-transformed ratio of 
Torl-treated cells/UT cells (3 heach) for quantification (n=8,285 proteins; n=3 
biologically independent samples). Pvalues were calculated by two-sided 
Welch’s t-test (adjusted for multiple comparisons); for parameters, individual 
Pvalues and q values, see Supplementary Table 2. e, Violin plots for TMT 
intensity of newly synthesized proteinsin wild-type, ATG7’ and RBICCI’ 293T 
cells (n=8,285, n=3,336 and n=3,418 proteins, respectively). Black horizontal 
lines show the median abundance of r-proteins (black circles); coloured 
horizontal dotted lines show the median of all proteins. Violin plots represent 
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the distribution and density of the whole dataset (centre line, median; limits, 
minimum and maximum values). f, Contribution of r-protein degradation to 


the net ribosome balance during nutrient stress. Ub, ubiquitination. 


g, Schematic of degradomics analysis using AHA pulse labelling (5 h) to 
examine proteome turnover and autophagy dependence during mTOR 


inhibition (12h). The final steps are as follows: (1) western blot to check sample 


quality; (2) click with biotin alkyne; (3) streptavidin enrichment; (4) TMT 
10-plex MS analysis. h, Pattern 1, accelerated degradation via autophagy; 
pattern 2, accelerated degradation independent of autophagy; pattern 3, 
stabilization independent of autophagy. Mean ratio values (+s.e.m.) of 10 
representative proteins (n=3 biological triplicate experiments). i, Volcano 
plots of the -log,)-transformed P value versus the log,-transformed ratio of 
Torl-treated cells/UT cells (12 h each) for 293T wild type, ATG7” and RBICC1” 
cells (n=8,304,n=8,319 andn=8,590 proteins, respectively; n=3 biologically 
independent samples). P values were calculated by two-sided Welch’s t-test 


(adjusted for multiple comparisons); for parameters, individual Pvalues and 


qvalues, see Supplementary Table 3.j, Mean ratio values (+s.e.m.) for AHA-labelled 
r-proteins asin g (n=58r-proteins, s.d.< 0.3 filter for every condition;n=3 


biologically independent samples). See also Extended Data Figs. 4-6, 


Supplementary Tables 2, 3. For gel source data, see Supplementary Fig. 1. 


Global degradomics during nutrient stress 


To directly examine the degradation of r-proteins during nutrient 
stress and the contribution of autophagy to turnover (Fig. 3f), we 
pulse-labelled wild-type, ATG7”" and RBICCI™ 293T cells with AHA, 
allowed 1h for the turnover of short-lived proteins in rich medium and 
then used TMT-based proteomics to compare the abundance of bioti- 
nylated AHA-labelled proteins with or without mTOR inhibition (12 h) 
(Fig. 3g, Extended Data Fig. 5a, b). Among the proteins quantified (more 
than 8,300), we found three patterns of turnover (Fig. 3h, Extended 
Data Fig. 5c, d): pattern 1, autophagy-dependent degradation, includ- 
ing several autophagy receptors”*; pattern 2, autophagy-independent 
turnover that is enhanced by mTOR inhibition, including several 
proteins that are known to be degraded by the proteasome upon MTOR 
inhibition”; and pattern3, autophagy-independent stabilization upon 
mTOR inhibition. Notably, r-proteins conformed to pattern 2, and were 
degraded in a largely autophagy-independent manner—although 
degradation rates varied for individual r-proteins witha correlated rank 
order with or without Tor] (R’ value of around 0.7) (Fig. 3i,j, Extended 
Data Fig. 5e, f). As expected”, 326 ER proteins were degraded in an 
autophagy-dependent manner, and serve asa positive control for this 
degradomics approach to detect autophagy substrates (Extended Data 
Fig. 5g-i). To confirm these results, we performed a time-course experi- 
ment to examine global degradomics at 5, 10 and 15 h after inhibition 
of mTOR (Extended Data Fig. 6a—g). Consistently, mTOR inhibition 
induced a faster turnover of r-proteins in a time-dependent manner, 
which was largely independent of an active autophagy pathway. 


Proteome partitioning during amino acid stress 


Given that several ribosome assembly steps occur in the nucleus 
or nucleolus, we examined whether nutrient stress alters the 
nuclear-to-cytosolic ratio of r-proteins (Extended Data Fig. 7a). After 
optimizing isolation of nuclei and cytosol (Extended Data Fig. 7b-f), 
we quantified cytosolic and nuclear protein and phosphoprotein 
pools with or without the withdrawal of amino acids (3 h) (Extended 
Data Fig. 7g-l). We observed a decrease in the nuclear abundance of 
FOXKI1, a decrease in the phosphorylation of RPS6, FOXK1, FOXK2 
and CAD, and an increase in the nuclear abundance of TFEB, MITF and 
TFE3—indicative of mTOR inhibition’ (Extended Data Fig. 7j-m). We 
also identified several nucleolar ribosome-biogenesis factors (PWP1, 
SDAD1 and NVL)” that accumulate in the cytosol after the withdrawal 
of amino acids (Extended Data Fig. 7j, k, n-p). By contrast, the distribu- 
tion of r-proteins was largely unchanged after amino acid withdrawal. 
As expected, around 91% of 60S r-proteins and 95% of 40S r-proteins 
are cytoplasmic in untreated cells—consistent with the longer time 
required to assemble the 60S subunit at steady state and the larger 
fraction of 40S subunits that are assembled in the cytosol”°” (Extended 
Data Fig. 8a). After the withdrawal of amino acids, the nuclear pool 
of r-proteins was reduced by 20% (around 1.5% of the total ribosome 
pool) (Extended Data Fig. 8b-d), indicating that the redistribution 
of r-proteins has a very minor role in ribosome homeostasis in these 
conditions. 


Ribophagic flux during nutrient stress 


Our finding that cells that lack the capacity for autophagy have no 
obvious defects in the response to nutrient stress, on the basis of 
total-proteome, Ribo—Halo, translatome and degradome analysis, led 
us to further quantify ribophagic flux. Using Ribo-Keima reporters’, 
we estimated that 3-4% of ribosomes are degraded in the lysosome 
after Torl treatment (10 h) (Extended Data Fig. 8e—-h), consistent with 
previous studies°. The magnitude of autophagic flux is such that RPS3 
turnover as measured by AHA-TMT proteomics (around 20% (s.d. < 0.1) 
in 12h) is not obviously sensitive to the deletion of ATG7 or RBICC1 


(Extended Data Fig. 8i). NUFIP1—a subunit of the nucleoplasmic R2TP 
pathway that is involved in 60S rRNA biogenesis’—was reported to 
traffic to the cytosol in response to nutrient stress and to function as 
aselective ribophagy receptor for lysosomal delivery*”. The absence 
of NUFIP1 re-localization after amino acid withdrawal (as measured by 
TMT-based proteomics; Extended Data Fig. 7j, k) led us to re-evaluate 
its role in ribophagy. First, NUFIPI“ 293T cells exhibit reduced levels 
of r-proteins in rich medium, in line with the previously defined role 
of R2TP in 60S biogenesis (Extended Data Fig. 9a, b). Second, NUFIP1 
deletion had no effect on the total Ribo-Halo abundance, synthesis or 
dilution (Extended Data Fig. 9c-e), or onr-protein abundance as meas- 
ured by TMT-based proteomics (Extended Data Fig. 9f-i) during mTOR 
inhibition. Third, RPS3-Keima flux was unaffected in NUFIPI“ cells 
subjected to nutrient stress (Extended Data Fig. 9j). We conclude that 
NUFIP1 is not required for ribophagic flux in these stress conditions. 


Single amino acid or purine depletion 


The enrichment (around twofold) of basic amino acids in r-proteins rela- 
tive to the proteome, together with the fact that ribosomes are richin 
rRNA, has led to the hypothesis that ribosomes are selectively used for 
autophagy in response to decreased levels of these building blocks*””. 
We used our r-protein analysis pipeline to assess the effects of selective 
withdrawal of Arg, Lys or Leu on ribosome homeostasis and ribophagic 
flux (Fig. 4a). Compared with Torl treatment or complete starvation of 
aminoacids, selective withdrawal of Leu or Arg resulted in only a partial 
loss of mTOR activity (Extended Data Fig. 10a, b), as described previ- 
ously*, However, withdrawal of Arg resulted in stronger translational 
inhibition, as measured by puromycin incorporation assay (Extended 
Data Fig. 10b, c), two-colour RPS3-Halo assay (Fig. 4b, Extended Data 
Fig. 10d) and translatome analysis (Fig. 4c, d, Extended Data Fig. 10e-h). 
The order of translational suppression was Arg > Torl > Lys > Leu for 
both the total proteome and r-proteins. However, ribophagic flux° 
was less than 3.4% in all cases (Extended Data Fig. 10i), indicating that 
acute loss of basic amino acids does not selectively promote ribophagy. 

We also found that 6-mercaptopurine (6-MP), an inhibitor of 
hypoxanthine-guanine phosphoribosyltransferase 1 (HPRT1) that 
blocks production of inosine monophosphate (IMP) and guanosine 
monophosphate (GMP)* (Extended Data Fig. 11a), markedly suppressed 
proteome and r-protein translation (Extended Data Fig. 11b-d), and 
also suppressed both dilution and new synthesis in the two-colour 
Ribo-Halo assay (Extended Data Fig. 11e, f). However, the reduction 
in r-protein abundance measured by TMT-based proteomics was 
ATGS-independent (Extended Data Fig. 11g). Consistent with this, 
ribophagic flux as measured using the RPS3-Keima processing assay 
was not detectably induced during treatment with 6-MP (24h), instark 
contrast to treatment with arsenite® (Extended Data Fig. 11h), indicat- 
ing that autophagy is unlikely to be a major driver in the regulation of 
r-protein abundance in response to purine depletion. 


Framework for r-protein homeostasis 


Here we have developed an experimental pipeline for broadly prob- 
ing the integration of biosynthetic and degradative mechanisms that 
control signal-dependent proteome remodelling, with a focus on 
r-proteins. We found that the major contributing factors in the con- 
trol of r-protein abundance in conditions of either mTOR inhibition or 
amino acid withdrawal were: (1) translational suppression of r-proteins; 
(2) reduced dilution of pre-existing ribosomes, reflecting lower rates 
of cell proliferation; and (3) non-autophagic degradation of r-proteins 
(Fig. 4e). By contrast, the contribution of ribophagy to overall changes 
in ribosome abundance was relatively small (around 3-4% in 10-12 h). 
The finding that r-protein turnover parallels that of proteasomal targets 
by degradomics (pattern 2, Fig. 3h) suggests that the proteasome is 
involved, but relevant ubiquitin ligases’ remain unknown. 


Nature | Vol 583 | 9 July 2020 | 307 


Article 


a b HCT116 RPS3-Halo 
Single amino acid removal Net bal piled Senet aa Meee 
(protein building block) et alance 
of ribosomes 
H2N xine NHg r-protein Merged 
synthesis 
ag a Old 
r 
zy Pg Sw 
( g Assembly 
coo” os 4 2 by 
cell division @ 
re 3 
2 
B 
Degradation 2 
: = 8 
H3N~ ~COO & 
Leu 
c d 20, 1 Untreated 
Met ‘ AHA (250 uM, 3 h) 1D Tort (3 h) 
Di -Arg (3 h) 
UT UT Tort -Ar Ly -Leu 
See 15 i -Lys (3 h) 


se pao sy pa Le ez 
a Se ee Be Bs > 1) -Leu (3 h) 


© Ribosome 


> ae oe 


Yo VVV VVV VVV VVV VV 
x1 x3 x3 x3 x3 x3 


sue 


WB or in-gel TMR fluorescence analysis 
Biotin-streptavidin enrichment 
TMTpro 16-plex MS analysis 


Relative abundance 


of TMT (scaled to 100) 
a 3 
i 1 
—<>— 
NU 
ies 
aates 


UT Tort -Arg -Lys -Leu 
e Net balance 
of ribosomes Translation Non-autophagic Cell volume 
Z degradation and division 
Eproteln \ Global Global 
i = 
synthesis proteome r-proteins proteome r proteins ae 
. © 2S == 
mes a} Nutrient 
os a2 
OG xs | stress 
Assembly | ag ° ‘S 
Dilution by ‘= ee Se 
cell division Nutrient — + —+ SE 


stress 


Ribosome conc. after nutrient stress 


Number of ribosomes ) 


Degradation A = 
Cell volume y x Cell division rate y 


Only a 5~8% decrease 
by total proteome analysis 


Fig. 4| Analysis of r-protein homeostasis in response to single aminoacid 
perturbations. a, Points of intersection of individual amino acids with net 
ribosome production. b, HCT116 RPS3-Halo cells were left untreated, 
subjected to the withdrawal of all amino acids or the selective withdrawal of 
Leu or Arg, or incubated with Torl before analysis of pre-existing and newly 
synthesized RPS3-Haloas in Fig. 2f. For quantification of n=2 biologically 
independent experiments, see Extended Data Fig. 10d. c, Experimental 
workflow for AHA translatome analysis of 293T cells with or without the 
indicated amino acids or with Torl treatment. d, Violin plots showing the 
relative proteome translation rates and the effect of removal of the indicated 
amino acids or inhibition of mTOR on the translation of individual r-proteins 
(black circles). Black horizontal lines show the median abundance of r-proteins; 
coloured horizontal dotted lines show the median of all proteins. Violin plots 
represent the distribution and density of the whole dataset (centre line, 
median; limits, minimum and maximum values). n= 3,372 quantified proteins 
(single shot). e, Ribosome homeostasis framework. Left, points of regulation. 
Blue arrows indicate the net effect of nutrient stress. Middle two panels, 
comparison of the effects of nutrient stress on translation and degradation 
(primarily through non-autophagic mechanisms) of the global proteome and 
r-proteins. Far right, effect of nutrient stress onthe dilution of r-proteins 
through cell division. The concentration of ribosomes (ribosome conc.) ina 
population of cells after nutrient stress will reflect a decrease inthe number of 
r-proteins through translational and degradative mechanisms, corrected for by 
changes in cell volume and cell division. See also Extended Data Fig. 10, 
Supplementary Table 6. For gel source data, see Supplementary Fig. 1. 


We developed a quantitative framework (Fig. 4e, Extended Data 
Fig. 11i, j) that facilitates an understanding of the factors that contrib- 
ute to the cellular concentration of ribosomes, taking into account 
the number of ribosomes, timing of cell division and cell volume. 
A central element of the model is that the number of ribosomes 
per cell increases by twofold over the course of the cell cycle and 
is reset upon division. In conditions of nutrient stress, changes in 
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the rates of r-protein synthesis and degradation slow the increase in 
r-protein abundance (numerator), whichis partially compensated for 
by changes in cell volume and the timing of division (denominator) 
(Extended Data Fig. 11i, j). Incorporation of our data on translational 
inhibition, degradation, cell volume and cell division rate into this 
model accurately estimates the total change in ribosome concen- 
tration measured by total proteomics. However, our present model 
uses zero-order kinetics for changes in translation, turnover and cell 
division, and more-precise models will require measurements of rates 
of change during the stress interval. Our framework highlights the 
importance of considering multiple mechanisms of r-protein homeo- 
stasis simultaneously to fully understand similarities and differences 
across cell systems that may respond to nutrient stress in distinct 
ways. In this regard, ribophagy in budding yeast has been proposed 
to be a major contributor to proteome remodelling in response to 
nitrogen deficiency’. This apparent differential use of autophagy in 
yeast compared to cultured mammalian tissue cells could reflect the 
fact that r-protein abundance in yeast is an order of magnitude higher 
than that seen in mammalian cells; however, it is also complicated by 
the fact that the duration of nutrient stress applied (24 h) is around 16 
times the cell division time in yeast’”°. Notably, approximately 10% of 
ER proteins are degraded within 10 h of the withdrawal of amino acids 
in humancells, and this is blocked by genetic ablation of autophagy”, 
as verified here by degradomics (Extended Data Fig. 5g-i). Given that 
the contribution of ER proteins to the total proteome mass rivals that 
of r-proteins®’, the ER may constitute a major target for autophagic 
degradation in response to the withdrawal of amino acids or inhi- 
bition of mTOR. Ribosomes may also be degraded to some extent 
by autophagy through association with the rough ER, as visualized 
in situ by electron microscopy”. It seems probable that in vivo and 
with chronic nutrient stress, autophagy will be responsible for the 
generation of amino acids suchas arginine, as has been shown in the 
context of tumorigenesis” 
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Methods 


Cell lines 

HEK293 (human embryonic kidney, fetus, ATCC CRL-1573, RRID: 
CVCL_0045), HCT116 (human colorectal carcinoma, male, ATCC CCL- 
247, RRID: CVCL_0291) and HEK293T (human embryonic kidney, fetus, 
ATCC CCL-3216, RRID: CVCL_0063) cells were grown in Dulbecco’s modi- 
fied Eagle’s medium (DMEM, high glucose and pyruvate) supplemented 
with 10% fetal calf serum and maintained in a5% CO, incubator at 37 °C. 
Karyotyping (GTG-banded karyotype) of HCT116, HEK293 and 293T 
cells was performed by Brigham and Women’s Hospital Cytogenomics 
Core Laboratory. All cell lines were found to be free of mycoplasma 
using the Mycoplasma Plus PCR assay kit (Agilent). 


Generation of RPS3-HaloTag7 and RPL29-HaloTag7 knock-in 
celllines using CRISPR-Cas9 gene editing 

Guide RNAs (gRNAs) targeting the C-terminal region of human 
RPS3 and RPL29 genes were designed using the CHOPCHOP web- 
site (http://chopchop.cbu.uib.no/). The guide sequences for 
RPS3 gene (5’- GACATACCTGTTATGCTGTG-3’) or RPL29 gene 
(5’-GAGATATCTCTGCCAACATG-3’) were assembled into a pX459 
plasmid”. Donor vectors were constructed by assembling a HaloTag7 
transgene with upstream and downstream homology arms (650 nucle- 
otide each) into a digested pSMART plasmid by Gibson assembly. 
HEK293, HEK293T and HCT116 cells were transfected with donor and 
gRNA vectors (1to 1 ratio) by Lipofectamine 3000 (Invitrogen). Five 
days after the transfection, the pool of transfected cells was treated 
with 100 nM Halo-TMR ligand for 1h, followed by washing three times. 
Fluorescence-positive cells were sorted into 96-well plates by flow 
cytometry (MoFlo Astrios EQ, Beckman Coulter). Three weeks later, 
the expanded single-cell colonies were screened for the integration of 
the HaloTag7 transgene by immunoblotting with a-RPS3 or a-RPL29, 
followed by genotyping. 


Generation of gene-knockout cell lines using CRISPR-Cas9 gene 
editing 

RB1CC1, ATGS, ATG7 and NUFIP1 knockout in 293T, HEK293 and 
HCT116 cell lines was carried out by plasmid-based transfection of 
Cas9/gRNA using a pX459 plasmid as described”’. First, six gRNAs for 
ATGS, ATG7 and NUFIP1 and four guide RNAs for RBICC1 knockout were 
designed using the CHOPCHOP website. Puromycin selection was per- 
formed 24 h after the transfection, and 48 h after the transfection, 
the gene-cleavage efficiency of each guide RNA was measured by Sur- 
veyor assay. The following gRNAs were shown to have the best cutting 
efficiency among the tested guides: 5’- GICCAAGGCACTACTAAAAG-3’ 
(exon 2) for ATG7; 5’-GATCACAAGCAACTCTGGAT-3’ (exon 5) for 
ATGS; 5’-GAAGAATCTGGGCGTCGAA-3’ (exon 1) for NUFIP1; and 
5’-GCTACGATTGACACTAAAGA-3’ (exon 7) for RBICCI. Single cells 
were sorted into 96-well plate using a limiting dilution method, and 
expanded clonal cells were screened by immunoblotting with a-ATG7, 
a-ATGS, a-RBICCI1, NUFIP1 and a-LC3B antibodies. 


Reagents 

Antibodies. The following antibodies were used: RPS3 (Cell Signaling 
Technology, 9538); RPL28 (Abcam, ab138125), RPS15a (Bethyl Lab, 
A304-990A-T), ATG7 (Cell Signaling Technology, 8558S), Keima (MBL 
international, M182-3), LC3B (MBL international, M186-3), ATGS (Cell 
Signaling Technology, 12994), p70 S6K phospho-T389 (Cell Signal- 
ing Technology, 9234S), phospho-S6 ribosomal protein Ser253, 236 
(Cell Signaling Technology, 4858), TEX264 (Sigma, HPAO17739), RPL23 
(Bethyl Lab, A305-010A-T), RPL7 (Bethyl Lab, A300-741A-T), RPL29 
(Proteintech Group, 15799-1-AP), Tubulin (Abcam, ab7291), SQSTM1 
(Novus Biologicals, HOO008878-MO1), anti-puromycin antibody (EMD 
millipore, MABE343), NUFIP1 (Proteintech Group, 12515-1-AP), RPS6 
(Cell Signaling Technology, 2217), 4EBP1 (Cell Signal Technology, 


9644), Lamin A/C (Cell Signal Technology, 4777), TFEB (Cell Signal 
Technology, 4042), SDAD1 (Bethyl Lab, A304-692A-T), NVL (Proteintech 
Group, 16970-1-AP), c-Maf (RnD systems, MAB8227-SP), IRDye 800CW 
Streptavidin (LI-COR, 926-32230), IRDye 800CW goat anti-rabbit IgG 
H+L (LI-COR, 925-32211) and IRDye 680 RD goat anti-mouse IgG H+L 
(LI-COR, 926-68070). 


Chemicals, peptides and recombinant proteins. The following 
chemicals, peptides and recombinant proteins were used: Tor] (Cell 
Signal Technology, 14379), SAR4.05 (APExBio, A8883), azidohomoa- 
lanine (Click Chemistry Tools, 1066-1000), 5-TAMRA alkyne (Click 
Chemistry Tools, 1255-1), biotin-PEG4-alkyne (Click Chemistry Tools, 
TA105-25), HaloTag R110Direct Ligand (Promega, G3221), HaloTag TMR 
(5mM) (Promega, G8251), 6-MP monohydrate (Sigma, 852678-1G-A), so- 
dium ascorbate (VWR international, 95035-692), poly-L-lysine solution 
(Sigma, P4832), FluoroBrite DMEM (Thermo Fisher Scientific A, 
1896701), benzonase nuclease HC (Millipore, 71205-3), urea (Sigma, 
Cat#U5378), sodium dodecyl sulfate (SDS) (Bio-Rad, Cat#1610302), Sur- 
veyor Mutation Detection Kit (Integrated DNA Technologies, 706025), 
Revert Total Protein Stain kit (LI-COR, P/N926-11010), 1,1,1,3,3,3-hexaflu 
oro-2-propanol (Sigma, 52517), DMEM, high-glucose, pyruvate (Gibco/ 
Invitrogen, 11995), DMEM, low-glucose, without amino acids (US Bio- 
logical, D9800-13), TCEP (Gold Biotechnology), puromycin (Gold Bio- 
technology, P-600-100), formic acid (Sigma-Aldrich, 94318), protease 
inhibitor cocktail (Sigma-Aldrich, P8340), PhosSTOP (Sigma-Aldrich, 
4906845001), trypsin (Promega, V511C), Lys-C (Wako Chemicals, 129- 
02541), Rapigest SF Surfactant (Glixx Laboratories, Cat#GLXC-07089), 
EPPS (Sigma-Aldrich, Cat#E9502), 2-chloroacetamide (Sigma-Aldrich, 
C0267), TMT 11plex Label Reagent (Thermo Fisher Scientific, 
Cat#90406 & #A34807), TMTpro 16plex Label Reagent (Thermo Fisher 
Scientific, Cat#A44520), hydroxylamine solution (Sigma Cat#438227), 
Empore SPE Disks C18 (3M - Sigma-Aldrich Cat#66883-U), Sep-Pak C18 
Cartridge (Waters Cat#WAT054960 and #WAT054925), SOLA HRP SPE 
Cartridge, 10 mg (Thermo Fisher Scientific, Cat#60109-001), High 
pH Reversed-Phase Peptide Fractionation Kit (Thermo Fisher 
Scientific, Cat#84868), High-Select Fe-NTA Phosphopeptide 
Enrichment Kit (Thermo Fisher Scientific, Cat#A32992), Bio-Rad Pro- 
tein Assay Dye Reagent Concentrate (Bio-Rad,#5000006) and Pierce 
Quantitative Colorimetric Peptide Assay (Thermo Fisher Scientific, 
#23275). 


Preparation of amino-acid-free medium 

DMEM powder (4.16 g, US biological) and 1.85 g sodium bicarbonate 
were dissolved in 400 ml H,0. To the solution, 1.75 g of glucose, 5 ml of 
100x sodium pyruvate (final concentration 1mM) and 50 ml of dialysed 
fetal calf serum were added, and 150 ul of 6N HCI was slowly added 
to bring the pH to 7.2. The final volume was adjusted to 500 ml. The 
medium was filtered through a 0.2-um filter and kept at 4 °C. 


Preparation of medium without Met, Lys or Arg 

One litre of base medium for the indicated amino acid withdrawal 
was prepared by dissolving DMEM powder (8.32g, US biological), 
3.7 g/lsodium bicarbonate, and 3.5 g/l glucose, 30 mg/I glycine, 63 mg/l 
cystine 2HCI, 580 mg/I glutamine, 42 mg/I histidine HCI-H,O, 105 mg/I 
isoleucine, 66 mg/I phenylalanine, 42 mg/I serine, 95 mg/I threonine, 
16mg/Itryptophan,104mg/Ityrosine2Na2H,0,94mg/lvalineand105mg/I 
leucine in 880 mLH,O, and the pH was titrated to pH 7.2 using 2M HCI. 
The base medium was filtered through a 0.2-um filter and kept at 4 °C. 
Before each experiment, 0.9 equivalence (v/v) of the base medium 
was added by 0.1 equivalence (v/v) of dialysed bovine serum albumin, 
0.01 equivalence (v/v) of 10Ox sodium pyruvate (final concentration1 
mM) and 0.001 equivalence of 1000x amino acids (methionine stock: 
37.5 mg/ml (final 37.5 mg/I=250 uM), lysine HCI stock: 146 mg/ml (final 
146 mg/l), arginine HCI stock: 84 mg/ml (final 84 mg/l) except the limit- 
ing amino acid. 


Preparation of leucine-free medium with or without methionine 
preparation 

The solubility of leucine in H,O was too low to be added at the last 
step as a concentrate. Therefore, the leucine-free medium had to be 
prepared separately from the —-Lys and —Arg media. To prepare 11 of 
medium, DMEM powder (8.32 g, US biological), 3.7g/l sodium bicar- 
bonate and 3.5 g/I glucose, 30 mg/I glycine, 63 mg/I cystine 2HCI, 
580 mg/I glutamine, 42 mg/I histidine HCI-H,O, 105 mg/l isoleucine, 
66 mg/I phenylalanine, 42 mg/I serine, 95 mg/I threonine, 16 mg/Itryp- 
tophan, 104 mg/I tyrosine 2Na 2H,0, 94 mg/l valine, 146 mg/I Lys, and 
84 mg/I Arg were dissolved in 880 ml H,O, and the pH was titrated to 
pH 7.2 using 2M HCI. The base medium was filtered through a 0.2-um 
filter and kept at 4 °C. Before each experiment, 0.9 equivalence (v/v) of 
the base medium was added by 0.1 equivalence (v/v) of dialysed bovine 
serum albumin, 0.01 equivalence of 100x sodium pyruvate (v/v) and 
0.001 equivalence (v/v) of 1OOOx methionine (final 37.5 mg/l =250 pM) 
or 0.01 equivalence (v/v) of 100x AHA (final 250 pM). 


Cell lysis and immunoblotting assay 

Immunoblotting assay was performed based on a previously reported 
method>. In brief, cells were cultured in the presence of the correspond- 
ing stress to around 40-50% confluency in 6-well plates, 10-cm or 15-cm 
dishes. After removing the medium, the cells were washed with DPBS 
three times, then in-house RIPA buffer (SO mM HEPES, 150 mM NaCl, 
1% sodium deoxycholate, 1% NP-40, 0.1% SDS, 2.5 mM MgCl, 10 mM 
sodium glycerophosphate, 10 mM sodium biphosphate) contain- 
ing mammalian protease inhibitor cocktail (Sigma), Phos-STOP, and 
20 unit/ml benzonase (Millipore) were added directly onto the cells. Cell 
lysates were collected by cell scrapers and sonicated onice three times, 
followed by centrifugation (13,000 rpm, 5 min). The concentration of the 
supernatant was measured by Bradford assay, and the whole cell lysate 
was further denatured by the addition of LDS sample buffer supple- 
mented with50 mM DTT, followed by boiling at 75 °C for 5 min. Twenty or 
thirty micrograms of each lysate was loaded onto the 4-20% Tris-Glycine 
gel (Thermo Fisher Scientific) or 4-12% NuPAGE Bis-Tris gel (Thermo 
Fisher Scientific), followed by SDS-PAGE with Tris-Glycine SDS running 
buffer (Thermo Fisher Scientific) or MES SDS running buffer (Thermo 
Fisher Scientific), respectively. The proteins were electro-transferred to 
PVDF membranes (0.45 pm, Millipore), and then the total protein was 
stained by Revert total protein stain kit (LI-COR) or Ponceau staining 
(Thermo Fisher Scientific). The membrane was then blocked with 5% 
non-fat milk (room temperature, 30 min), incubated with the indicated 
primary antibodies (4 °C, overnight), washed three times with TBST (total 
30 min), and further incubated either with fluorescent IRDye 680RD goat 
anti-mouse IgG H+L, IRDye 8OORD goat anti-mouse IgG H+L, or IRDye 
800CW goat anti-rabbit IgG H+L secondary antibody (1:15,000) for1 
h. After thorough wash with TBST for 30 min, near infrared signal was 
detected using OdysseyCLx imager and quantified using ImageStudio- 
Lite (LI-COR). For quantitative immunoblotting of endogenously tagged 
Ribo-Keima reporter cells, at least 70 pg of total lysate was loaded onto 
an SDS-PAGE gel owing to the low level of processed Keima. A 4-20% 
Tris-Gly gel was used to resolve the proteins. 


Flow cytometry analysis for one-colour Ribo—Halo labelling 

The corresponding Ribo-Halo cells were plated onto 24-well plates one 
day before the nutrient stress. The cells were left untreated or treated 
with 200 nM Tor] for the indicated time. One hour before collecting 
the cells, 100 nM of TMR ligand was added in addition to the Tor1 for 
HaloTag labelling. The cells were washed with fresh DMEM with or 
without Tor1, three times with 10-min duration each. After trypsin 
treatment, the collected cells were resuspended in 250 pl FACS buffer 
(1x DPBS, 1 mM EDTA, 1% FBS, 25 mM HEPES, final pH 7.3-7.5) and ana- 
lysed by flow cytometry (LSR-II Analyser, BD Biosciences). The data 
were processed by FlowJo software. 


Flow cytometry analysis for two-colour Ribo-Halo labelling 

The corresponding Ribo-Halo cells were plated onto 24-well plates, and 
100 nM of TMR-Halo ligand was added and incubated for 1h to label the 
pre-existing ribosomes, then washed away. The cells were incubated 
with 2 ml of fresh DMEM for 10 min in the dark to remove the remaining 
free TMR ligand. This was repeated for three times in total. Then the 
cells were grown either in rich medium or in Torl treatment medium, 
both containing 50 nM R110 Halo ligand. In parallel, control Ribo—-Halo 
cells were stained only with TMR ligand or with R110 ligand, the signal 
of which represented 100% ribosome abundance and was used as a 
normalization factor. The cells were than collected after trypsin treat- 
ment, resuspended in 250 pl FACS buffer (1x DPBS, 1 mM EDTA, 1% FBS, 
25mMHEPES, final pH 7.3-7.5), and analysed by flow cytometry (LSR-II 
Analyser, BD Biosciences). The data were processed by FlowJo software. 
In brief, the 488- and 561-nm intensities of individual cells (>3,000 cells) 
were exported to Microsoft Excel. Each fluorescence signal from the 
single cells was normalized by the signal intensity from the 100% red- 
or 100% green-ligand-treated cells for the same duration. The old to 
new ribosome ratio distribution graph was processed by dividing the 
normalized red signal with the normalized green signal coming from 
the same single cells and plotted using Prism software. 


In-gel fluorescence analysis for two-colour Ribo—Halo labelling 
Related to Figs. 2, 4. The corresponding Ribo-—Halo cells were plated 
onto 12-well plates, and 100 nM of TMR-Halo ligand”° was added and 
incubated for 1hto label the pre-existing ribosomes, then washed away. 
The cells were incubated with 2 ml of fresh DMEM for 10 min inthe dark 
to remove the remaining free TMR ligand. This was repeated for three 
times in total. Then the cells were grown either in rich medium or in 
Torl1 treatment medium, both containing 50 nM R110 Halo ligand. In 
parallel, control Ribo—Halo cells were stained only with TMR ligand 
or with R110 ligand, the signal of which represented 100% ribosome 
abundance and was used as a normalization factor. The cells were 
than collected after trypsin treatment, resuspended in RIPA buffer 
and sonicated three times on ice. Following Bradford assay, 15 pg of 
the lysates were taken from each sample and resolved by SDS-PAGE 
using a4-12% NuPAGE Bis-Tris gel (Thermo Fisher Scientific). The in-gel 
TMR and R110 signals were detected by ChemiDoc MP (Bio-Rad). The 
gel was then either stained with Coomassie blue staining for loading 
control, or the proteins were transferred to a PVDF membrane for the 
subsequent immunoblotting analysis. The ratio change of the TAR 
and R110 signals was analysed using Image Lab software (Bio-Rad). 


Measurement of cell size 

Related to Extended Data Figs. 2,3. Cells were grown ona 24-well plate 
with or without Tor1 (200 nM) for 14 h. Then, the cells were treated with 
trypsin for 3 min, followed by the addition of 1ml DMEM. The cells were 
collected and spun down at 1,000 rpm for 3 min, and the DMEM was 
removed. The cell pellet was resuspended in 500 ul of DMEM with or 
without Torl1 (200 nM) at room temperature. A 75-p1l quantity of each 
sample (around 2.0 x 10° cells/ml) was injected to Moxi Cassette and 
the size measured by Moxi GO II. The single cell-size measurements 
were processed by FlowJo and plotted using Prism software. 


Confocal microscopy 

Related to Fig. 2. Live-cell confocal microscopy was performed based on 
a previously reported method’. In brief, cells were plated onto a35-mm 
glass-bottom dish (no. 1.5, 14-mm glass diameter, MatTek) pre-treated 
with poly-L-lysine. Two days later, the cells were incubated in 100 nM 
TMR-ligand-containing medium for 1h. Following the washing steps 
(three times for 10 min each), the cells were incubated with or with- 
out Torl in the presence of 50 nM R110 Halo ligand for 14 or 24 h. The 
medium was changed to FluoroBrite DMEM (Thermo Fisher Scientific) 
before the live-cell imaging. The cells were imaged using a Yokogawa 
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CSU-X1 spinning disk confocal with Spectral Applied Research Aurora 
Borealis modification on a Nikon Ti motorized microscope equipped 
with a Nikon Plan Apo 60x/1.40 N.A or 100x/1.40 N.A objective lens. 
Pairs of images for ratiometric analysis of TMR and R110 fluorescence 
were collected sequentially using 100 mW 488-nm and 100 mW 561-nm 
solid state lasers attenuated and controlled with an AOTF (Spectral 
Applied Research LMM-S), and the emission was collected with a 
525/50-nm or 620/60-nm filter (Chroma Technologies), respectively. 
Wide-field fluorescence images of Hoeschst were collected using a 
Lumencor SOLA light source, 395/35-nm excitation and 480/40-nm 
emission filters (Chroma Technologies). Both confocal and wide-field 
images were acquired with the same Hamamatsu ORCA-ER cooled CCD 
camera and MetaMorph software. For the analysis, the same gamma, 
brightness and contrast were applied for eachimage using Fiji software. 


AHA incorporation and TMR-in-gel fluorescence analysis for 
translatome analysis 

Related to Fig. 3. Wild-type 293T cells were plated onto a10-cm dish 
with around 20% confluency 24 h before the experiment. Commercial 
DMEM medium was replaced with in-house prepared DMEM medium 
containing either 250 pM Met or 250 uM AHA in addition to the indi- 
cated nutrient stress medium. The cells were washed three times with 
DPBS and lysed by the addition of RIPA buffer and sonication (4 °C, 
three times). Following Bradford assay, 30 1g of lysate was subjected 
to immunoblotting for pS6K, pS6 and 4E-BP1 for a quality control of 
the proper inhibition of mTOR. From the remaining lysates, an equal 
protein amount was collected across the samples, reduced by 5mM 
TCEP (10 min, room temperature), then alkylated by chloroacetamide 
(20 mM final concentration, 15 min, room temperature). CHCI,/MeOH 
precipitation was performed to remove any remaining AHA in the 
cell lysates. The white protein disk was then resuspended in 2% SDS 
(SO mM HEPES 150 mM NaCl, pH 7.2, 2.5 mM TCEP) and sonicated once 
with a tip sonicator. Ten microlitres of the resuspended lysate was 
added with 10 pl of 2x Click-master mix containing 2 mM CuSO,,2mM 
sodium ascorbate, 200 uM TBTA ligand and 200 uM TMR-alkyne in 
H,O. The click reaction mixture was then incubated in the dark for 
1h (with rocking, room temperature; final sample concentration 
0.825 pg/pl). The mixture was then resolved by SDS-PAGE, andthe TMR 
signal detected by ChemiDoc MP (Bio-Rad) and analysed by Image Lab 
software (Bio-Rad). 


AHA incorporation, biotin-click and streptavidin enrichment 
for translatome analysis 

Related to Fig. 3. Wild-type 293T cells were plated onto a10-cm dish 
with around 20% confluency 24 h before the experiment. Commercial 
DMEM medium was replaced with in-house prepared DMEM medium 
containing either 250 pM Met or 250 uM AHA in addition to the indi- 
cated nutrient stress. The cells were washed three times with DPBS 
and lysed by the addition of RIPA buffer and sonication (three times). 
Following Bradford assay, 30 pg of the lysate was subjected to immu- 
noblotting for pS6K, pS6,and 4E-BP1 for a quality control of the proper 
inhibition of MTOR. We also confirmed that AHA addition (250 tM) 
without prior Met withdrawal minimized the effects on mTOR activ- 
ity using pS6K band intensity”. From the remaining lysates, an equal 
protein amount was collected across the samples, then reduced by 
5mMTCEP (10 min, room temperature) and alkylated by chloroaceta- 
mide (20 mM final concentration, 15 min, room temperature). CHCI,/ 
MeOH precipitation was performed to remove any remaining AHA in 
the cell lysates. The white protein disk was then resuspended in 2% 
SDS (SO mM HEPES 150 mM NaCl, pH 7.2, 2.5 mM TCEP) and sonicated 
once with atip sonicator. At least 1 mg of the lysates resuspended in 
2% SDS was taken and reacted with the click reagents (final concentra- 
tion: 1 mM CuSO,, 1mM sodium ascorbate, 100 pM TBTA ligand, and 
100 uM biotin-alkyne; 1% SDS), and the mixture was incubated for 2h 
with rocking at room temperature. This was followed by CHCI,/MeOH 


precipitation again to remove excess biotin-alkyne, and the lysate was 
resuspended with 200 ul of 2% SDS (TCEP 2.5 mM) and diluted with RIPA 
to make <1 mg/ml concentration of the lysate (final SDS concentration 
is <0.5%). Meanwhile, 10 pl of high-capacity streptavidin agarose beads 
multiplied by the number of samples (for example, 100 pl for 10 sam- 
ples; the bead has a10 mg/ml BSA-biotin capture capacity. Therefore, 
10 ul of beads can roughly capture 100 pg of biotinylated proteins 
when the average molecular weight is assumed to be 66 kDa) were 
washed with 1 ml RIPA and distributed to 10 x 1.5 ml Eppendorf tubes. 
The lysates were added to the beads and incubated overnight at room 
temperature. Flow-through was then stored for the quality control, 
and the beads were washed with RIPA x2, 1 M KCI, 0.1M Na,CO;, 2M 
urea in HEPES buffer x2 and RIPA x2. The beads were transferred onto 
hydrophilic PTFE membrane filter cups and washed with water twice. 
Completely dried beads were then resuspended in 50 pl of hexafluor- 
oisopropanol (HFIP), incubated for 5 min with shaking and eluted by 
spinning. Fifty microlitres of HFIP was added to the beads again, and the 
eluates were combined and dried in a speedvac for 10 min (the boiling 
point of HFIP is 58 °C; volatile). This sample was used for either western 
blotting or TMT-MS analysis. Note: as a quality control, we gathered 
the flow-through and blotted it for streptavidin-IRDye800 for equal 
capture. Also, the flow-through was precipitated with the CHCI,/MeOH 
method followed by click reaction with TMR-alkyne (an orthogonal 
alkyne reagent to the biotin-alkyne) to check the efficiency of the click 
reaction. On the basis of this method, we concluded that the efficiency 
of the initial click reaction was always over 90% when compared with 
the negative control that does not contain AHA. 


AHA incorporation and TMT-MS sample preparation for 
degradome analysis 

Related to Fig. 3. Equal numbers of the wild-type 293T cells were 
plated onto a10-cm dish with around 20% confluency 24 h before the 
experiment. Commercial DMEM medium was replaced with in-house 
prepared DMEM medium containing either 250 1M Met or 250 pM 
AHA and the cells were incubated for 5 h for AHA incorporation. The 
AHA-containing medium was removed, and the cells were washed with 
rich DMEM. The cells were incubated in the rich DMEM supplemented 
with 10% FBS for 1h for the degradation of short-lived proteins with 
AHA incorporation. Then the control cells were collected immediately 
by addition of the RIPA buffer (containing 0.5% SDS instead of 0.1%), 
whereas the untreated and nutrient-stress samples were further grown 
inrich medium or Torl (200 nM)-containing medium for 12h. To avoid 
the dilution factor of AHA labelled proteome through cell division, 
we collected the lysates from the entire dish and processed for qual- 
ity control immunoblotting without normalizing the total proteome 
level using Bradford assay. The remaining lysates were reduced by 
5mMTCEP (10 min, room temperature), alkylated by chloroacetamide 
(20 mM final concentration, 15 min, room temperature), and proteins 
were precipitated by the CHCI,/MeOH method. The white protein disk 
was then resuspended in 2% SDS (SO mM HEPES 150 mM NaCl, pH7.2, 
2.5mM TCEP) and sonicated once with a tip sonicator. A small aliquot 
of the lysates was reacted with TMR alkyne for a quality check, and the 
rest was clicked with biotin alkyne followed by streptavidin capture. In 
this case, the beads were washed with more stringent buffers (RIPA x2, 
2% SDS x2, 3M urea x2, 0.1M Na,CO;, RIPA x2, ddH,O x3). 


Preparation of nuclear—cytosolic partitioning samples for 
immunoblotting and TMT-MS analysis 

Related to Extended Data Fig. 7. The final protocol used in this study 
(Method 3 in Extended Data Fig. 7b) was modified from a previous 
study”. HEK 293T cells were plated onto 10-cm dishes with around 
20% confluency 24 h before the experiment. The cells were either 
left untreated or treated with -AA medium for three hours, then 
washed three times with DPBS. Then the cells were lysed by adding 
800 ul of the lysis buffer (SO mM HEPES pH 7.2, 150 mM NaCl, 0.1% 


NP-40, 10 mM glycerophosphate, 10 mM sodium biphosphate, pro- 
tease inhibitor cocktail, 2.5 mM MgCl.) directly on to the dish, and 
the lysates were collected by scraping and pipetted up and down 6 
times. Over 90% of the cells showed cell-membrane rupture with an 
intact nucleus when observed under the microscope after Trypan 
blue staining. Fifty microlitres was taken and snap-frozen using liquid 
nitrogen (whole-cell lysate). The rest of the lysate was spun down at 
7,000 rpm for 30 s and the supernatant taken and snap-frozen (cyto- 
solic fraction). The remaining pellet was washed with 1 ml DPBS and 
spun again at 7,000 rpm for 30s, and the supernatant was removed 
(the pellet is the nuclear fraction). The nuclear-fraction pellet was 
resuspended in 300 ul RIPA buffer containing benzonase and soni- 
cated for three times. The cytosolic fraction was added with 80 pl 
of 10x RIPA buffer to equalize the surfactant concentration as inthe 
nuclear fraction, and sonicated. Bradford assay was performed using 
the whole-cell lysate, cytosolic fractions and the nuclear fraction. 
One hundred micrograms of the proteins from both nuclear and 
cytosolic fractions were taken and processed for MS analysis, and 15 
pg of each fraction was taken and subjected to immunoblotting as 
quality control. This method resulted in around 3.2-3.6 times more 
loading of the nuclear fractions compared to the cytosolic fractions 
when calculated back to the total protein level. This ratio was applied 
for the final calculation of the nuclear-cytosolic protein abundance 
in the TMT-MS analysis. 

Methods in Extended Data Fig. 7b: Method 1: (a) Buffer: 1% TritonXx, 
10 mM B-glycerophosphate, 10 mM sodium pyrophosphate, 40 mM 
HEPES pH 7.4, 2.5 mM MgCl, protease inhibitor cocktail. (b) Lysis 
method: the cell lysates were incubated for 15 min at 4 °C with rock- 
ing, followed by centrifugation for 5 min at 13,000 rpm. The pellet was 
washed with DPBS once and resuspended in RIPA buffer for nuclear frac- 
tion. (c) Total time, around 25 min. Method 2: (a) Buffer: 0.1% Tritonx, 
10 mM B-glycerophosphate, 10 mM sodium pyrophosphate, 40 mM 
HEPES pH 7.4, 2.5 mM MgCl, protease inhibitor cocktail. (b) Lysis 
method: the cell lysates were incubated for 15 min at 4 °C with rock- 
ing, followed by centrifugation for 5 min at 13,000 rpm. The pellet was 
washed with DPBS once and resuspended in RIPA buffer for nuclear 
fraction. (c) Total time, around 25 min. Method 3: (a) Buffer: 0.1% NP40 
in DPBS (1 mM KH,PO,, 150 mM sodium chloride, 5.6 mM Na,HPO,, 
protease inhibitor cocktail, pH 7.3-7.5 (b) Lysis method: the lysates 
gathered in the lysis buffer were pipetted three times and centrifuged 
at 13,000 rpm for 10s. The pellet was washed with DPBS once and resus- 
pended in RIPA buffer for nuclear fraction. (c) Total time, around 3 
min. Method 4: (a) Buffer: 0.05% NP40, 10 mM HEPES, 1.5 mM MgCl, 
10 mMKCI,0.5 mM DTT, pH around 7.3. (b) Lysis method: the cells were 
collected in the lysis buffer and sat on ice for 10 min for osmotic cell 
lysis, followed by centrifugation at 3,000 rpm for 10 min, 4 °C. The 
supernatant was snap-frozen while the pellet was washed with the 
lysis buffer, and centrifuged for 1 min at 3,000 rpm. The pellet was 
resuspended in RIPA buffer for nuclear fraction. (c) Total time, around 
25 min. 


Proteomics workflow 
Anextensive description of proteomics methods and detailed param- 
eters is included in the first sheet of each Supplementary Table. 


Sample preparation. Samples were reduced and alkylated, 
chloroform-methanol precipitated, reconstituted in 100 mM EPPS 
(pH 8.5) and digested by Lys-C and then by trypsin. Samples were 
TMT-labelled for 60 min at room temperature. After a labelling efficien- 
cy check, samples were quenched, pooled and desalted for subsequent 
LC-MS/MS analysis. When indicated, pooled sample was first offline 
fractionated with basic pH reversed-phase liquid chromatography ina 
96-well plate and combined for a total of 24 fractions® before desalting 
and subsequent liquid chromatography-tandem mass spectrometry 
(LC-MS/MS) analysis. 


Data acquisition. Samples were analysed on Orbitrap Tribrid Se- 
ries mass spectrometers coupled to a Proxeon EASY-nLC pump 
(Thermo Fisher Scientific). Peptides were separated on a 35-cm 
column packed in-house using a 95 to 110 min gradient. MS' data were 
collected using the Orbitrap (120,000 resolution). MS” scans were 
performed in the ion trap with CID fragmentation (isolation window 
0.7 Da; rapid scan; NCE 35%). Each analysis used the Multi-Notch 
MS?-based TMT method™, to reduce ion interference compared to 
MS? quantification, combined in some instance with newly imple- 
mented Real Time Search analysis software®”*, and with the FAIMS 
Pro Interface (using previously optimized 3 CV parameters for TMT 
multiplexed samples*’). MS* scans were collected in the Orbitrap 
using a resolution of 50,000 and NCE of 65 (TMT) or 45 (TMTpro) 
for HCD fragmentation. The closeout was set at two peptides per 
protein per fraction, so that MS’s were no longer collected for pro- 
teins having two peptide-spectrum matches (PSMs) that passed 
quality filters*°. 


Data analysis. Mass spectra were processed using a Comet-based 
(v.2018.01 rev.2) in-house software pipeline*®”’ or Sequest-HT using 
Proteome Discoverer (v.2.3.0.420 - Thermo Fisher Scientific). Database 
searching included all canonical entries from the human reference pro- 
teome database (UniProt Swiss-Prot - 2019-01) and sequences of com- 
moncontaminant proteins. Searches were performed using a20 ppm 
precursor ion tolerance, and recommended product ion parameters 
for ion trap MS/MS were used. TMT tags on lysine residues and peptide 
N termini (+229.163 Da for amino-TMT or +304.2071 Da for TMTpro) 
and carbamidomethylation of cysteine residues (+57.021 Da) were set 
as static modifications, and oxidation of methionine residues (+15.995 
Da) was set as a variable modification. For phosphopeptide analysis, 
+79.9663 Da was set as a variable modification on serine, threonine 
and tyrosine residues. PSMs were filtered to a 1% false discovery rate 
(FDR) using linear discriminant analysis as described previously*. 
Using the Picked FDR method”, proteins were filtered to the target 
1% FDR level. Phosphorylation site localization was determined using 
the AScore algorithm“. For reporter ion quantification, a 0.003 Da 
window around the theoretical m/z of each reporter ion was scanned, 
and the most intense m/z was used. Reporter ion intensities were ad- 
justed to correct for the isotopic impurities. Peptides were filtered to 
include only those peptides with a sufficient summed signal-to-noise 
ratio across all TMT channels. An isolation purity of at least 0.7 (70%) 
in the MS! isolation window was used for samples analysed without 
online real-time searching. For each protein, the filtered peptide TMT 
or TMTpro signal-to-noise values were summed to create protein quan- 
tification values*®. 

Protein quantification values were exported for further analysis 
in Microsoft Excel and Perseus” and statistical tests and parameters 
used are indicated in the corresponding Supplementary Tables. In 
brief, two-way Welch’s t-test analysis was performed to compare two 
datasets, using sO parameter (in essence a minimum fold-change 
cut-off) and correction for multiple comparisons was achieved by the 
permutation-based FDR method, both functions that are built-in in 
Perseus software. For whole-cell proteome analysis, each reporter ion 
channel was summed across all quantified proteins and normalized 
assuming equal protein loading of all samples. For the AHA-derived 
translatome, as well as nuclear-cytosolic fractionation, no normaliza- 
tion based on loading was performed. For AHA-derived degradome 
normalization, we used a previously reported strategy”, which is 
based on the assumption that there are stable proteins within the 
pool of proteins measured, the amounts of which decay very little 
during the time course of the experiment. 

Supplementary Tables list all quantified proteins as well as associated 
TMT reporter intensity and ratio change to control channels used for 
quantitative analysis. 
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Annotations for bona fide organellar protein markers were assem- 
bled using the proteins that had scored with confidence ‘very high’ or 
‘high’ from the previously published HeLa dataset**. Annotations for 
ribosome and autophagy components were manually curated and 
assembled based on the literature. Nuclear annotations are based on 
the dataset for this paper (Supplementary Table 4). 


Statistics and reproducibility 

All statistical data were calculated using GraphPad Prism 7 or Perseus. 
Comparisons of data were performed by two-way ANOVA with Tukey’s 
multiple comparisons test; Pvalues of less than 0.01 were considered 
significant. All experiments were repeated at least three times unless 
otherwise indicated. 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized, and the investigators were not 
blinded to allocation during experiments and outcome assessment. 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Allthe mass spectrometry proteomics data have been deposited to the 
ProteomeXchange Consortium via the PRIDE repository (http://www. 
proteomexchange.org/): dataset 1 (related to Supplementary Table 1; 
PXDO17852, PXDO17853); dataset 2 (related to Supplementary Table 2; 
PXDO018252); dataset 3 (related to Supplementary Table 3; PXDO17857, 
PXDO18158); dataset 4 (related to Supplementary Table 4; PXDO17856, 
PXDO17855); dataset 5 (related to Supplementary Table 5; PXDO17858, 
PXDO17851); dataset 6 (related to Supplementary Table 6; PXDO17861, 
PXDO17860, PXDO17859). Source data are provided with this paper. 
Full gel data for immunoblots are provided in Supplementary Fig. 1. All 
datasets generated within this study are available online, and the rea- 
gents are available from the corresponding author on request. Source 
data are provided with this paper. 
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Extended Data Fig. 1| Reduction of r-proteins during nutrient stress is not 
discernible by immunoblotting methods regardless of autophagy. a, 
Schematic of the ribosome analysis pipeline. b-d, (top panels) Volcano plots 
(-Log,o p-value versus Log, ratio Torl/UT) for 293T cells (n= 8029 proteins), 
ATG7“ (n=8373 proteins) or RBICCI” (n=8332 proteins) (b), HEK293 cells 
(n=7531 proteins), ATGS” (n=7504 proteins) (c), or HCT116 cells (n=3779 
proteins), ATGS” (n=3761 proteins) or RBICCI” (n=3671 proteins) (d).n=3 
(UT); 4 (Torl) biologically independent samples. Pvalues were calculated by 
two-sided Welch’s t-test (adjusted for multiple comparisons); for parameters, 
individual Pvalues and q values, see Supplementary Table 1. Green dots 
represent r-proteins. Data for 293T cells are from ref. ”. Histograms below the 
individual volcano plots show the mean + s.d. of relative abundance of 
autophagy adaptors with or without nutrient deprivation. n=3 (UT) orn=4 
(-AA and Tor) biologically independent samples. U, untreated; A, —AA;T, Torl. 
Data for 293T cells are from ref. 2. e, The relative abundance changes” for 
proteins located inthe ER, Golgi, or the ribosome in 293T cells treated as inb 
are plotted asa violin plot (n=340, 349, 343, 340, 349, 343, 87, 89, 86, 87, 89, 86, 
72,75,70,72, 75, and 70 proteins, from left to right). R-protein abundance 
change is not affected by autophagy unlike other vesicular organelles. The 


violin curves represent the distribution and density of the indicated dataset 
(Centre-line: median; Limits: minima and maxima). f, Plots of relative 
abundance of individual r-protein in HEK293 cells upon either 10h of amino 
acid withdrawal (left) or Torl treatment (right). 39 r-proteins with less 
than+10% error range for every condition. Mean +s.d. forn=3 (UT) orn=4 
(-AA and Tor) biologically independent samples. g, 293T cells with or without 
ATG7 or RB1CC1 were either left untreated, subjected to amino acid withdrawal 
(10h) or treated with Tor (10h) and whole cell extracts immunoblotted for the 
indicated proteins. h, HCT116 cells with or without ATGS or RBICC1 were 
treated as indicated, and whole cell extracts immunoblotted for the indicated 
proteins. i-k, Extracts from the indicated cells (30, 15, 7.5 or 3.75 1g) were 
immunoblotted with the indicated antibodies (i,j). The signal intensity for the 
indicated r-proteins as a function of quantity loaded was measured using 
Odyssey (k), showing no indication of signal saturation and no detectable 
difference between cells with or without active autophagy. Related to Fig. 1. 
The experiments shown in g-j were repeated more than three times 
independently and showed similar results. For gel source data, see 
Supplementary Fig. 1. 
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Extended Data Fig. 2 | Generation of Ribo-Halo reporters and extensive 
titration assays for quantification. a, The Halotag7 protein (referred to here 
as ‘Halo’) was endogenously tagged at the C terminus of RPL29 or RPS3 as the 
indicated r-proteins contain solvent-exposed C termini and are located far 
from the peptide exit tunnel based on the structure of an 80S complex. (PDB: 
SAJO). b, c, Gene editing of HCT116, HEK293 and 293T cells using CRISPR-Cas9 
to fuse Halo with the C termini of RPS3 and RPL29. Homozygous incorporation 
of Halo was confirmed by genotyping (b). Extracts from the indicated cells 
were subjected to immunoblotting with RPS3 (c, left) or RPL29 (c, right). 
Protein translation efficiency of wild-type and Halo knock-in cells were 
compared using puromycin incorporation assay (c, bottom). 

d, Immunoblotting of WT or RPS3-Halo 293T cell lysates after nutrient stress, 
confirming no detectable difference between the two cell lines in response to 
mTOR inhibition. The full immunoblot is shown in Extended Data Fig. 9i. 

e-g, Halo-ligand titration assays with the indicated incubation time were 
performed using flow cytometry analysis (e, f) or in-gel fluorescence analysis 
(g) for the labelling saturation. Inf, background signal from free Halo-ligand 
was measured using WT HCT116 cells in comparison to RPS3-Halo, confirming 
that the free ligand does not contribute to the observed fluorescence signal. 
h, HCT116 RPS3-Halo cells were incubated with 250 nM Halo-TMR ligand for 1h, 
washed for indicated numbers by incubating cells in ligand free medium for 
20 min each time, followed by 17-h prolonged incubation in ligand free 
medium. i, Extracts from the RPS3-Halo HCT116 cells (20, 15, 10 or 5 1g) treated 
with the indicated Halo ligands were subjected to in-gel fluorescence analysis. 
The fluorescence signal intensity of each lane was directly proportional to the 
loading amount. We noted that R110 fluorophore was excited by epi-green 
excitation (520-545 nm) and detected in 577-613 nm. We subtracted this bleed- 
through signal for TMR quantification. j, Measurement of the effect of Torlon 
cell size with HCT116 RPS3-Halo (left) and RPL29-Halo (right) using Coulter 
Principle-based cell measurements. k, Cell proliferation assay. HCT116 RPS3- 
Halo and RPL29-Halo cells were grown in rich medium or Torl (200 nM) 


containing medium for 12, 16 and 24 h. The data estimates around 16-hcell 
division rate for untreated cells, and around 24-h cell division rate for Torl 
treated cells. Mean+s.d. for n=4 biologically independent experiments. 

I, Ratio of pre-existing to newly synthesized RPL29-Halo per cell plotted 
against cell populations as a frequency histogram (left). Average from the 
triplicate experiments plotted as a bar graph (right). Pre-existing Ribo—Halo 
proteins in HCT116 RPL29-Halo cells were labelled with TMR ligand (100 nM, 
th), followed by the thorough washing and addition of 50 nM Green-ligand (also 
called R110-ligand). The newly synthesized RPL29-Halo was chased for 8,16, 
and 24h before flow cytometry analysis. Error bars represent s.d. See Methods 
for details. m, Pre-existing RPL29-Halo proteins were labelled with TMR ligand 
(100 nM, 1h) in HCT116 cells, and the newly synthesized RPL29-Haloin the 
presence or absence of Torl (200 nM) were labelled with Green-ligand. The 
ratio of R110 to TMR signals plotted against cell populations (left), and the 
mean +s.d. values from the triplicate experiments of 8, 16 and 24-h pulse chase 
plotted asa bar graph (right) are shown. n, In-gel fluorescence images of the 
cell extracts treated asinm. The same gels were then transferred to PVDF 
membranes for immunoblotting measurement of total RPL29 level.n=3 
biologically independent samples. o, In-gel fluorescence images of the cell 
extracts from 293T RPS3-Halo or HCT116 RPS3-Halo cells using the labelling 
strategy inm. Relative synthesis of RPS3-Halo with or without Torlis plotted on 
the right. Mean for n=2 experiments. p, Live-cell imaging of HCT116 RPS3-Halo 
cells labelled with TMR (for pre-existing r-proteins) and Green (for newly 
synthesized r-proteins) ligands with or without Torl (200 nM, 24 h). Scale bar, 
20 um. q, Live-cell imaging of HCT116 RPL29-Halo cells labelled with TMR (for 
pre-existing r-proteins) and Green (for newly synthesized r-proteins) ligands 
with or without Torl (200 nM, 14h). Scale bar, 20 um. r, The indicated cells were 
left untreated or incubated with Torl for 8 h before immunoblotting with the 
indicated antibodies. Related to Fig. 2. The experiments in d,jand p-r were 
repeated three times independently with similar results, and b, cand e-iwere 
performed once. For gel source data, see Supplementary Fig. 1. 
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Extended Data Fig. 3 | Minimal contribution of ribophagy to control of 
r-protein synthesis and dilution by cell divisionin response to nutrient 
stress. a,b, Histogram of normalized TMR signal in RPS3-Halo (a) and RPL29- 
Halo (b) HCT116 cells with or without ATG8 conjugation (ATG7 for RPS3-Halo 
and ATGS for RPL29-Halo) or RB1CC1 incubated with or without 200 nM Torl for 
14h, followed by 1h TMR ligand treatment and flowcytometry analysis. >3x10° 
and >4x10° cells were analysed, respectively. c,d, Mean+s.d. of the triplicate 
data from cells treated asina, bare plotted, respectively. e, Effect of Torl 
treatment on cell size in HCT116 RPL29-Halo WT, A7GS“ and RBICCI™ cells, as 
measured using Coulter Principle-based cell measurements. Mean +s.d. of the 
triplicate data. f, g, Ratio of pre-existing (red) to newly synthesized (green) 
r-proteins per cell plotted against cell populations as a frequency histogram for 
RPS3-Halo (f) or RPL29-Halo (g) HCT116 cells with or without ATGS, ATG7 or 
RBI1CC1 based on the labelling scheme in Fig. 2f. h, Quantification of relative 
amounts of pre-existing and newly synthesized r-proteins from datainf, g. 


Mean +s.d.,n=3 biologically independent experiments. i,j, HCT116 RPS3-Halo 
cells with or without ATG7 or RB1CC1 were left untreated or treated with Torl 
for 14h (i) and 24h (j) using the Halo tagging scheme in Fig. 2f. Extracts were 
subjected to SDS-PAGE and in-gel fluorescence analysis, followed by 
immunoblotting with the indicated antibodies. k, ], Quantification of relative 
amounts of pre-existing and newly synthesized r-proteins from data ini,j. 
Mean +s.d.,n=3 biologically independent experiments. m, Live-cell imaging of 
HCT116 RPL29-Halo cells with indicated genotypes labelled with TMR (for pre- 
existing r-proteins) and Green (for newly synthesized r-proteins) ligands with or 
without Torl (200 nM, 14h). Scale bar, 20 um. n, An example of gating strategy 
used for flow cytometry analysis. Green and Red only control experiments are 
shownat the bottom. Experiments in m,n were repeated more than three times 
independently with similar results. For gel source data, see Supplementary 

Fig. 1. 
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Extended Data Fig. 4| Global decoding of protein translation during 
nutrient stress via independent AHA-TMT methods. a, b, Total AHA 
incorporation levels with or without prior Methionine starvation (30 min) were 
compared using 293T cells grown in Met or AHA (250 pM) for the indicated 
duration, followed by click with TMR alkyne and in-gel fluorescence analysis 
(a). The quantification of duplicate experiments is shown inb.c,d,293T cells 
were grown in medium with Met (250 uM) or the indicated concentration of 
AHA for the indicated time periods. Extracts were clicked with TMRand 
subjected to SDS-PAGE before in-gel TMR fluorescence analysis (c). TMR 
intensity was quantified ind. e, f,293T cells grown in AHA (250 uM) with or 
without Tol for the indicated time periods and extracts clicked with TMR 
before processing as inc. The effect of Torl on TMR fluorescence is quantified 
inf. g,293T cells were incubated with or without amino acid withdrawal or Tor1- 
containing medium in the presence of Met or AHA (250 pM each), as in Fig. 3c. 
Cell extracts were subjected to SDS-PAGE followed by immunoblotting. Three 
biologically independent samples are shown inthe same blot. h, TMR signals in 
Fig. 3c. were quantified as described in Methods and plotted in top panel, and 
the relative TMT signal of the total biotinylated proteome in Fig. 3d. is plotted 
in bottom panel. Centre dataare mean+s.d.n=1,3,3and 3 biologically 
independent samples, from left to right for top and bottom panels. 

i, Translatome analysis. 293T cells were carried through the workflow in Fig. 3b 
with amino acid withdrawal and extracts clicked with biotin before enrichment 
onstreptavidin and TMT-based proteomics. Plot of -Log,. p-value versus Log, 
ratio -AA/untreated is shown for n= 8285 proteins. n =3 (UT; -AA) biologically 
independent samples.j, Translatome plot of -Log,) p-value versus Log, ratio — 
AA/Torlis shown (n= 8285 proteins). r-proteins skewed to the right side of the 


volcano plot indicates that Torl suppresses the translation of r-proteins more 
strongly than amino acid withdrawal, unlike the majority of the proteome. n=3 
(-AA; Tor]) biologically independent samples. Pvalues iniandj were calculated 
by two-sided Welch’s t-test (adjusted for multiple comparisons); for parameters, 
individual Pvalues and q values, see Supplementary Table 2. k—-o, The relative 
abundance of all quantified biotinylated proteins are shownink. Proteinsin 
individual groups are shown in I-o.I: representative proteins showing over 
twofold more reduction in translation than the average translation in both Torl 
and -AA conditions, m: proteins with over twofold less reduction in translation 
than the average translation in both Torl and -AA conditions, n: proteins 
showing twofold less reduction only in Torl condition, with p-value 
(Supplementary Table 2) between Torl and —AA was less than 0.05. o: proteins 
showing twofold less reduction only in-AA condition, with p-value 
(Supplementary Table 2) between Torl and —AA was less than 0.05.n=3 
biologically independent samples per conditions (UT; —AA; Torl). Centre data 
are mean+S.E.M. p-r, 293T cells lacking either ATG7 or RBICC1 were subjected 
to amino acid withdrawal or Torl treatment for 3 hin the presence of 250 uM 
Met or AHA. Cell extracts were subjected to SDS-PAGE followed by 
immunoblotting with the indicated antibodies (p) or clicked with TMR and 
in-gel fluorescence analysis (q). TMR signals were quantified as described in 
Methods (r). Centre data are meants.d.n=1,3,3and 3 biologically 
independent samples, from left to right. s, Correlation plot (Log, ratio of 
treated/untreated) for the translatome upon either Torl treatment or amino 
acid withdrawal. Related to Fig. 3. Experiments in c-f were performed once and 
pwas performed three times independently with similar results. For gel source 
data, see Supplementary Fig. 1. 
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Extended Data Fig. 5| Global decoding of protein degradation during 
nutrient stress via independent AHA-TMT methods. a, Extracts from cells as 
described in Fig. 3g were subjected to immunoblotting with the indicated 
antibodies to demonstrate suppression of the mTOR activity by Torl. 

b, Biotinylated extracts as described in Fig. 3g were subjected to 
immunoblotting with a fluorescent streptavidin conjugant, showing the pre- 
existing proteome (Streptavidin-IRdye) versus the total proteome (Revert total 
protein stain) in the lysates. n =3 biologically independent samples are shown 
inaandb.c, Volcano plots (-Log,) p-value versus Log, Torl/UT 12h) for 293T WT 
(n= 8304 proteins), ATG7” (n=8319 proteins), and RBICCI” (n=8590 proteins) 
cells asin Fig. 3i, but with pattern 1 proteins in Fig. 3h coloured as red dots, 
pattern 2 proteins coloured as green dots, and pattern 3 proteins coloured as 
blue dots. Pvalues were calculated by two-sided Welch’s t-test (adjusted for 
multiple comparisons); for parameters, individual Pvalues and g values, see 
Supplementary Table 3. n=3 (UT; Torl) biologically independent samples. 


d, Volcano plots as inc, but with autophagy related proteins coloured as 
indicated. Autophagy adaptors dramatically degraded upon Torl treatment 
only in WT cells are shown in the circle. e, Plots of individual ratio for AHA- 
labelled r-proteins using the protocol in Fig. 3g in WT, ATG7” ,, or RBICCI’ 293T 
cells. r-proteins with STDEV <0.3 for every condition are selected (n=58 
r-proteins). Centre data are mean +s.d. from 3 biologically independent 
samples for each condition. f, Relative turnover rates for individual r-proteins 
frome, with or without mTOR inhibition have a correlation of R?~0.7 (n=58 
r-proteins). Grey dotted lines are 95% confidence intervals of the best-fit line 
(solid black line) result from a simple linear regression analysis. g, Volcano 
plots as ind, but with ER resident proteins coloured in purple. h, i, Individual 
ratio for n=326 AHA-labelled ER membrane resident proteins using the 
protocol in Fig. 3gin WT, ATG7”, or RBICCI” 293T cells (h) and plots for all 
individual ER resident proteins data points used inh (i). Centre data are 

mean +s.e.m. Related to Fig. 3. For gel source data, see Supplementary Fig. 1. 


a (293T WT, ATG7”“, RB1CC1~) 


e e ° ° @ O e Oo ° ® ° 2 oes 
e e ° e e e e e ° e e e e e e e L AHA 
Met AHA AHA AHA AHA AHA AHA AHA i 
labelini 
y 5h 5h 5h u 5h 5h 5h u 5h 5h J ng 1. WB for mTOR inhibition (quality check) 
Z > as i ee ae a= “Sa a 4 a 2s _ | short-lived 2. Click with TMR alkyne (quality check) 


ih i ih ut ih ut ih tt ih protein 3. Click with biotin alkyne (complete reaction check) 


5 degradation | 4. Streptavidin enrichment 


J J UT UT UT Tori Tor1 Tor1 5. TMTpro 16plex MS analysis 
Whe We Isr Ih Nigh Iter Ign hes cas : 


= = + long-lived 
‘<a * -” =a =a 
Zs S22 B®... “ES “a” “as protein 
Z—- & cA P_> ca 5 F 
. Dp D> J degradation 
= aa ~ = = -_ 
V \x2 \ Vx2 \V Vx2 V \x2 V \x2 \x2 
HEK293T WT HEK293T ATG7* HEK293T RB1CC1~ 
HEK293T WT , or A vr , 1 
7 A . oh UT UT UT TortTor1Tor1 oh UT UT UT TortTor1 Tort oh UT UT UT TortTor1Tor1 
UT UT UT Tort Tort Tort isan £f5 arr = ariel Ee cart tart are terTy 
Bh on One toh Soy On teh KXggsssesesesese <CSSSSSTSSSTSS ge <<seessesssesess UT UT UT TortTortTort 


pS6K 


in-gel fluorescence 
__ | 


mT Oh Sh 10h 15h 5h 10h 15h 
S$ tracaracatacaca 
[ooh oWoMoRo oho Moo woyomouey 
ILOGTH OO GS TOOT OD 
PEEP Er TT ee aeaaee 


o 
2 
pS6 8 
l | TTY 1) eeeeee erry g 
7) ; £ 
3 2 hae 
o 
e e no further reactivity with TMR-alkyne suggests 
a g on aeege a near complete reaction yield with the biotin-alkyne 
2. er [eee +h . aeecee Bee . * 2 
Seeene la Beaaeeae ’ i] |Peeeee jeeesee 
saneae H s he Click reaction yield assay 
e f 
Pattern 1 
ATG7" . RB1CC1* 
1.0 . . . 
<= Ribosome: 
‘= @ WT 
5 @ ATG7“ 
205 @ RBICC1*" 
2 i) 
os 
‘ 
0.0 ReteceD Beterer bercereo 
3352525 2387878 287878 «® » “\ 
time 0 5 1015 0 5 1015 0 5 1015(h) 2, 
(GABARAPL2, TEX264, SEC62, PCM1) z 
& 
Pattern 2 rN 
WT ATG7"  RB1CC1" a 
1.0 ? 
g % 2 = 
5 05 Log, ratio (Tor1/UT) Log, ratio (Tor1/UT) Log, ratio (Tor1/UT) 
2 293T WT 293T ATG7" 293T RB1CC1*- 
=| (n=3334 proteins) (n=3351 proteins) (n=3375 proteins) 
eg (single shot) (single shot) (single shot) 
0.0 BeEcErRE KeErEreKD KECETCED 
=>: 52,575 5555555 S5E5555 g 1 5 
time 0 5 1015 0 5 1015 0 5 10 15(h) : 
(HMGCS1, UCK2, MKLN1, TIMM17a, SHQ1, GNL3L) 5h 10h 15h 
Pattern 3 a 
ATG7* RB1CC1* & 1.0 = 
1.0 oS = 
5 |fghgts B,agns rb 
= =) 
© ° L it 
BS) 
5 as 9 0.5 
2 . a ns ns KK RRR KeRK RK kK kkKK * 
& a 
n=48 n=48 n=48 
0.0 BSrbEbE SSEbE55 SSESS55 0.0 T T T T T T T T T T T T T T T T T T 
JHE TEE (EE Ee WCF EE 
ime 05 10 15 05 1015 0-5 10 15(h) UT Tori UT Tort UT Tor1 UT Tort UT Tort UT Tort UT Tort UT Tor1 UT Tor1 
(SLBP, UTP1, CDCAS, NUCB1, LONP2, HSPE1) WT ATG7* RB1CC1* WT ATG7* RB1CC1* WT ATG7* RB1CC1* 


Extended Data Fig. 6| See next page for caption. 


Article 


Extended Data Fig. 6| Protein degradation time-course experiment during 
nutrient stress via AHA-TMT methods. a, Schematic of AHA-based 
degradomics time-course to examine r-protein turnover during nutrient stress 
with or without functional autophagy. b, Lysates from 293T cells treated asina, 
were subjected to SDS-PAGE followed by immunoblotting to confirm proper 
mTOR inhibition. c, Cell extracts were reacted with TMR alkyne for click 
reaction followed by in-gel fluorescence analysis. d, Click reaction yield across 
the replicates was confirmed indirectly. In brief, cell extracts treated asina, 
were clicked with biotin alkyne followed by streptavidin capture. The proteins 
in the flow-through was precipitated, resuspended in 2% SDS, then clicked with 
TMR-alkyne. e, Patterns of protein turnover, as described in the main Fig. 3h, in 
the time-course experiment. mean +s.e.m. Proteins analysed (n=4 top;n=6 
middle and bottom) are shown below. f, Volcano plots for the indicated time 


(-Logio p-value versus Log, Torl/UT) in 293T WT (n =3334 proteins), ATG7” 
(n=3351 proteins), and RBICCI” (n= 3375 proteins) cells. r-proteins, red, green 
or blue. Pvalues were calculated by two-sided Welch’s t-test (adjusted for 
multiple comparisons); for parameters, individual Pvalues and q values, see 
Supplementary Table 3. g, A box plot of the individual ratio for AHA-labelled 
r-proteins using the protocol ina (centre line, median; box limits correspond to 
the first and third quartiles; whiskers, 10-90 percentiles range).n=48 
r-proteins that were quantified across WT, ATG7” and RBICCI” 293T cells. 
two-sided t-test, P= 0.1514, 0.2818, 0.0005, 0.000016, 0.000051, 0.0005, 
0.000001, 0.000020, 0.0105 from left to right; NS: non-significant, *P< 0.1, 
**P< 0.001, ****P< 0.0001. Experiments in b-d were replicated twice 
independently and showed similar results. For gel source data, see 
Supplementary Fig. 1. 
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Extended Data Fig. 7 | See next page for caption. 
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Extended Data Fig. 7 | Optimization of nuclear-cytosolic partitioning 
during nutrient stress. a, Contribution of cytoplasmic and nuclear 
partitioning to net ribosome balance upon nutrient stress. b,c, Optimization 
of the nuclear and cytosolic fraction partitioning method using 293T cells. Side 
by side comparison of four previously published methods was performed. 
Surfactant used in method1:1% Triton, method 2: 0.1% Triton, method 3: 0.1% 
NP40, method 4: 0.05% NP40. For more details on methods 1-4, see Methods. 
Methods 2 and 3 were further compared inc. 1: ctrl, 2: Leptomycin B (20 nM, 
16h), 3: MgCl, added in lysis buffer, 4: 2+3, 5: ctrl, 6: Leptomycin B (20 nM, 16h), 
7:MgCl, added in lysis buffer, 8: additional pipetting. d, Effect of the centrifugal 
velocity and duration on nuclear-cytosol partitioning is shown. 1:13K rpm, 
10sec, 2:10K rpm, 10sec, 3: 7K rpm, 10sec, 4: 7K rpm, 30sec, 5: 5K rpm, 60sec, 6: 
5K rpm, 180sec, 7:3K rpm, 60sec, 8: 3K rpm, 180sec. e, f, Lysates collected after 
Torl treatment for 0,1 o0r3 hsubjected to the optimized nuclear-cytosol 
partitioning, followed by immunoblotting against the indicated antibodies (e). 
Quantification measured by Odyssey shown inf. g, Scheme depicting strategy 
for quantitative analysis of changes in nuclear and cytosolic protein abundance 
in 293T cells in response to short period (3 h) of amino acid withdrawal. 

h, i, Biochemical characterization of nuclear and cytosolic 293T cell fractions 
in response to amino acid withdrawal. Extracts (15 pg of cytosol and nuclei) 
were separated by SDS-PAGE and immunoblots probed with the indicated 
antibodies (see Methods).j, k, Volcano plots (-Log,. p-value versus 


Log, -AA/UT) for nuclear (j) or cytosolic (k) proteins (n= 9193 proteins) 
quantified by TMT-based proteomics. Pvalues were calculated by two-sided 
Welch’s t-test (adjusted for multiple comparisons); for parameters, individual 
Pvalues and q values, see Supplementary Table 4. Nuclear fraction (j) n=3 
(UT; -AA), cytosolic fraction (k) n=2(UT);n=3 (-AA) biologically independent 
samples. I, Relative RPS6, FOXK1, FOXK2, and CAD phospho-peptides 
abundance quantified by TMT-based proteomics confirming strong inhibition 
of mTOR by Torl. Centre data are mean+s.d.n=2 for UT RPS6 and CAD, and3 
for the rest. m,n, Relative abundance of proteins that translocate either from 
cytosol to nucleus (m) or from nucleus to cytosol (n) after 3h amino acid 
withdrawal, including proteins linked with nutrient dependent transcription 
(TFEB, MITF, TFE3 - accumulating in the nucleus), and proteins involved in 
ribosome assembly (PWP1, SDAD1, NVL - exported from the nucleus to the 
cytosol). Centre data are mean+s.d.n=3, 2,3 and 3 biologically independent 
samples, from left to right for each indicated proteins. o, p, 293T cells were 
treated with -AA medium for the indicated time period, partitioned into 
nuclear and cytosolic fractions, followed by immunoblotting against the 
indicated antibodies (0). Odyssey quantification shown in p. Experiments 
shown in b-d were performed once, e, h, iwere performed more than three 
times independently with similar results, and o was performed twice with 
similar results. For gel source data, see Supplementary Fig. 1. 
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Extended Data Fig. 8| Contribution of nuclear-cytosolic partitioning to 
r-protein abundance during nutrient stress and ribophagy flux 
measurement using the Ribo—Keima system. a, Relative abundance of 
individual r-proteins from the 60S subunit (left) or 40S subunit (right) with the 
cytosol fraction in grey and the nuclear fraction in blue. (less than+10% error 
range for every replicate) Individual r-proteins that are thought to assemble 
onto the ribosome either late in the assembly process or specifically in the 
cytosolare indicated in red font. mean+s.d.n=2 for cytosolic fraction and 3 for 
nuclear fraction as shown in Extended Data Fig. 7g. b,c, Abundance of nuclear 
and cytosolic r-proteins after amino acid withdrawal (3 h). 60S subunits are on 
top (n=38), and 40S subunits are at the bottom (n= 26). Right panels indicate 
the relative r-protein abundance change normalized by UT of cytosolic or 
nuclear fraction. mean+s.e.m.d,—-AA/UT ratio of individual r-proteins from 
both nuclear and cytosolic fractions collected after 3 hamino acid starvation 
indicates heterogenous distribution with RPS7 most strongly down regulated. 
n=64,mean+s.e.m.e, Lysates from HEK293 RPS3-Keima cells after the 
indicated nutrient stress were immunoblotted against anti-Keima antibody 
(top). Abundance ratio of the processed Keima to the intact Keima measured by 


Odyssey is plotted (bottom). f, Flow cytometry analysis of HEK293 RPS3-Keima 
cells to obtain normalization factors. To achieve a condition in which cells have 
0% of the ribosomes in the lysosome, RPS3-Keima cells were treated with 
SAR405 for 10h and collected in pH7.2 FACS buffer. To achieve a theoretical 
condition in which 100% ribosomes are present in the lysosome, the cells were 
incubated in pH4.5 FACS buffer containing 0.1% Triton-X. We used the 561/488 
ratio from these two measurements to calculate the % lysosomal ribosomes ing 
and h.n=1742 cells for each. g, h, HEK293 RPS3-Keima cells were left untreated 
or treated with Torl (200 nM) inthe presence or absence of SAR405 (1 1M) for 
12h.561nm ex to 488 nm ex Keima signal was measured by flowcytometry and 
plotted as either a frequency histogram (g,n=1742 cells for each) ora bar graph 
(h, n=3 biologically independent samples, mean +s.d.).i, TMT-based 
quantification of endogenous RPS3 abundance in 293T WT, ATG7”, and 
RBICCI” cells treated and processedas in Fig. 3g. n= 3 biologically 
independent samples. mean+s.d.Experimentsine, g, hwere repeated three 
times independently with similar results, and f was repeated once. For gel 
source data, see Supplementary Fig. 1. 
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Extended Data Fig. 9 | See next page for caption. 


Extended Data Fig. 9 | NUFIP1 deletion does not affect the ribosome 
inventory changes with nutrient stress. a, Extracts fromthe 293T cells with 
or without NUFIP1immunoblotted with the indicated antibodies after 10h of 
amino acid withdrawal, showing the same level of mTOR inhibition and 
r-proteins abundance regardless of the NUFIPI deletion. Three biologically 
independent samples are blotted except two samples for NUFIP”* -AA 
condition. b, Volcano plot (-Log,) p-value versus Log, ratio (NUFIPI” /WT)) of 
293T cells with or without NUF/PI deletion (n=7032 proteins). Pvalues were 
calculated by two-sided Welch’s t-test (adjusted for multiple comparisons); for 
parameters, individual Pvalues and q values, see Supplementary Table5.n=3 
biologically independent samples per genotype.c, Normalized TMR signal in 
HCT116 RPS3-Halo NUFIPI” or “ cells incubated with or without 200 nM Torin 
for 24h, followed by 1h TMR ligand treatment and flowcytometry analysis. 
Mean +s.d. of the triplicate data are plotted. d, Average ratio of pre-existing to 
newly synthesized RPS3-Halo per cell with or without NUFIP1 plotted as a bar 
graph. Pre-existing Ribo—Halo proteins were labelled with TMR ligand (100 nM, 
1h), followed by the addition of S50 nM Green-ligand. n=3 biologically 
independent samples. mean ¢+s.d.e, Live-cell imaging of indicated Ribo-Halo 
cells with or without NUFIP1 labelled with TMR (for pre-existing r-proteins) and 
Green (for newly synthesized r-proteins) ligands with or without Torl (200 nM, 
14h). Scale bar, 20 ppm. f, Schematic description of the triple TMT-MS analysis of 


the whole cell lysates gathered from WT 293T or RPS3-Halo 293T cells with or 
without NUFIP1 after nutrient stress for 10 h.g, Lysates of cells treated asinf 
were immunoblotted against the indicated antibodies for quality control, 
showing that mTOR activity was properly inhibited in all three cell types. 

h, Volcano plots (-Log,) p-value versus Log, ratio Nutrient stress/Untreated) of 
the cells treated as inf (WT n= 2072 proteins; RPS3-Halon=2105 proteins; 
RPS3-Halo and NUFIPI“” n=2241 proteins). Introducing HaloTag at the 
endogenous locus did not alter the mTOR inhibition nor ribosome abundance 
change after nutrient stress. Deletion of NUFIP1 did not show detectable 
difference either. Pvalues were calculated by two-sided Welch’s t-test (adjusted 
for multiple comparisons); for parameters, individual Pvalues and q values, see 
Supplementary Table 5.n=3 (UT); 4 (-AA, Tor1) biologically independent 
samples per cell line. i, Immunoblotting of the cell lysates prepared as inf 
shows that introducing HaloTag at the endogenous locus did not alter the 
mTOR inhibition nor ribosome abundance change after nutrient stress. 
Deletion of NUFIP1 did not show detectable difference either, consistent with 
the TMT-MS analysis.j, Keima processing assay using the lysates from HEK293 
RPS3-Keima WT or NUFIP~ cells after the indicated nutrient stress. Also see 
Supplementary Table 5. Experiments in e were repeated twice with similar 
results, and g, i,j were repeated three times independently with similar results. 
For gel source data, see Supplementary Fig. 1. 
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WT HEK293T cells were used in all panels, otherwise indicated. 
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Extended Data Fig. 10| See next page for caption. 


Extended Data Fig. 10 | Systematic analysis of r-protein homeostasis in 
response to withdrawal of single amino acids. a, 293T cells were treated with 
either Torl or 6-MP or were incubated in medium lacking Leu, Arg, or amino 
acids for the indicated times and cell extracts subjected to immunoblotting 
with the indicated antibodies. b, Quantification of the WB datainaandc. 

c, Puromycin incorporation assay after treating 293T cells with the indicated 
medium and time course. Immunoblotting against anti-puromycin antibody 
was probed using either infrared fluorophore labelled second antibody 
coupled with Odyssey or Horseradish peroxidase labelled second antibody 
coupled with ECL for comparison. d, Quantification of the relative abundance 
of pre-existing (red) and newly synthesized (green) RPS3-Halo in Fig. 4b. mean, 
n=2.e, Cell extracted treated as indicated were analysed for either mTOR 
inhibition or translatome using AHA incorporation assay coupled with TMR- 


alkyne click. f, Time-course protein synthesis assay was performed using AHA 
clicked with TMR-alkyne method after the indicated nutrient stress. g, h, Cells 
treated as in Fig. 4c were clicked with TMR-alkyne and analysed by in-gel 
fluorescence signal (g). Immunoblot assays using the indicated antibodies for 
quality control is shown inh (top), and relative TMR signal is plotted below. 
Centre data are mean+s.d.n=1,3, 3, 3,3 and 3 biologically independent 
samples, from left to right. i, Lysates from HEK293 RPS3-Keima cells after the 
indicated nutrient stress were immunoblotted against antibodies for Keima or 
mTOR substrates (top). Abundance ratio of the processed Keimato the intact 
Keima is plotted (bottom). Related to Fig. 4. Experiments ina,c,g,iwere 
repeated three times independently with similar results, and e, f,iwere 
performed once. For gel source data, see Supplementary Fig. 1. 
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Predict 5.6% reduction of ribosome concentration after 12h Tor1 treatment, comparable to the 5-8% seen by total proteome analysis. 
12h) = 0.16*A, as autophagy is responsible for 4% of RPS3 degradation out of 20% 


In autophagy defective cells, A,(Tor1, 


Extended Data Fig. 11|See next page for caption. 


Extended Data Fig. 11| Systematic decoding of r-protein homeostasis in 
response to purine deficiency. a, Schematic diagram indicating the points of 
intersection of nucleotides availability with net ribosome production. 6-MPis 
an inhibitor of purine biosynthesis and blocks production of rRNAs. b, 293T 
cells were treated with or without Torl, 6-MP, or -Arg medium for the indicated 
time duration. AHA (250 uM) was added 3h before collecting the cells. Lysates 
were either immunoblotted against the indicated antibodies for mTOR 
signalling inhibition (bottom) or processed for TMR-click reaction for in-gel 
fluorescence analysis (top). c, 293T cells treated with 6-MP for the indicated 
time points as well as AHA (for the last three hours) were analysed for either 
mTOR inhibition (middle) or translatome using AHA incorporation assay 
coupled with TMR-alkyne click (top). Relative translation efficiency is plotted 
(bottom). Centre data are mean+s.d.n=1,3,3,3 and 3 biologically 
independent samples, from left to right. d, 293T cells were incubated inthe 
presence or absence of 6-MP for the indicated times, and AHA (250 1M) was 
added 3h before collecting each lysate. The translatome was analysed by 
biotinylation of AHA-labelled proteins followed by TMT-based proteomics. 
Avolcano plot (-Log,, p-value versus Log, 6-MP/UT) showing the translatome 
and individual r-proteins (in red) at three time points. Pvalues were calculated 
by two-sided Welch’s t-test (adjusted for multiple comparisons); for 
parameters, individual Pvalues and q values, see Supplementary Table 6.n=1 
(Neg Ctrl); n=3 (UT, 9h); 2 (6h, 18 h) biologically independent samples per cell 
line. e, Pre-existing RPL29 in HCT116 RPL29-Halo cells was labelled with TMR 
(red), washed and then incubated with medium containing green Halo ligand 
with or without 6-MP (24h). Live cells were imaged. f, HCT116 RPS3-Halo cells 
were subjected to 2-colour labelling as in Fig. 2f. Cells were either left untreated 
or incubated with 6-MP before analysis of pre-existing and newly synthesized 


RSP3-Halo using in-gel fluorescence signal-based quantification. Histograms 
show the relative abundance of pre-existing (red) and newly synthesized 
(green) RPS3-Halo. Centre dataare mean+s.d.n=2,3,3,3,3,3and3 
biologically independent samples, from left to right for both histograms. g, 
Total proteome analysis of 293T cells (with or without ATGS) was performed 
according to the scheme (top). Volcano plots (-Log,, p-value versus Log, 6-MP/ 
UT) for all quantified proteins (n = 8234 proteins), including individual 
r-proteins marked witha red dot, are shown at the bottom. Pvalues were 
calculated by two-sided Welch’s t-test (adjusted for multiple comparisons); for 
parameters, individual P values and q values, see Supplementary Table 6.n=3 
(UT); 2 (6-MP) biologically independent samples per cell line. h, Keima 
processing assay using the lysates from HEK293 RPS3-Keima WT or ATGS5~ cells 
after 6-MP treatment (24h). Cells treated with arsenite was also blotted asa 
positive control, as it was previously reported to induce selective ribophagy. 
i,j, Analysis of ribosome concentration using biosynthetic, degradative and 
cell division information is shown asa simple equation ini.j, Change of 
ribosome concentration in response to nutrient stress using 0.2 as the 
degradation rate (derived from AHA-degradomics measurements), translation 
rates (T,) of 0.35 derived from AHA-translatome analysis, anda cell cycle factor 
of 1.5 (Y =1+t/24, t=12 h) derived from the proliferation assay. We find that the 
ribosome concentration upon 12h of Torl treatment [0.944*A,/Vo} is 
comparable to the reduction in ribosomes we measured by total proteome 
analysis (reduction of ribosomes from -5-8%). Summary of the systematic 
quantitative analysis of ribosome inventory during nutrient stress. Related to 
Fig. 4. Experiments in b were performed once, ande, hwere repeated three 
times independently with similar results. For gel source data, see 
Supplementary Fig. 1. 
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Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 
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n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
Lt AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection 1. Orbitrap Eclipse Tribrid Mass Spectrometer (Cat#FSNO4-10000) with FAIMS Pro Interface (HFMSO2-10001) - Thermo Fisher Scientific 
2. Orbitrap Fusion Lumos Tribrid MS (Cat#lIQLAAEGAAPFADBMBHQ) with or without FAIMS Pro Interface (#FMSO2-10001) - Thermo 
Fisher Scientific 
3. Orbitrap Fusion Tribrid MS (Cat#IQLAAEGAAPFADBMBCX) with FAIMS Pro Interface (#FMSO2-10001) - Thermo Fisher Scientific 
4. On-line real-time searching was done via Orbiter version 1 (DOI: 10.1021/acs.jproteome.9b00860) a function newly implemented in 
the latest released Orbitrap Eclipse Tribid Mass Spectrometer factory software (Thermo Fisher Scientific) 
5. Odyssey CLx Imager LI-COR bioscience 
6. MoxiGo II; ORFLOW Technologies Inc 
7. ChemiDocMP (Bio-rad); https://www.bio-rad.com 


Data analysis 1. Prism; GraphPad, v7&8 https://www.graphpad.com/scientific-software/prism/ 

2. Proteome Discoverer; Thermo Fisher Scientific, v2.3 https://www.Thermo Fisher.com/order/catalog/product/OPTON-30795 
3. SEQUEST; Eng et al., (1994) J Am Soc Mass Spectrom. 5 (11): 976-989. 

4. Comet (v2018.01 rev. 2); Eng, J.K. et al. (2013), Proteomics 13, 22-24. 

5. Perseus; Tyanova et al., Nat Methods. (2016) 13:731-40. http://www.perseus-framework.org 
6. FlowJo; V10.5.2 https://www.flowjo.com 

7. ImageStudioLite V 5.2.5 https://www.licor.com/bio/products/software/image_studio_lite 

8. Image Lab Software V6.0.1 https://www.bio-rad.com 

9. FiJi ImageJ V.2.0.0 https://imagej.net/Fiji 

10. Microsoft Excel v16 https://www.microsoft.com 

11. MetMorph v7.10.1 https://www.moleculardevices.com 
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For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- A description of any restrictions on data availability 


Data Availability 

All raw mass spectrometry data has been deposited in PRIDE ProteomeXchange (http://www.proteomexchange.org/): Extended Data Set 1 (PXD017852, 
PXD017853); Extended Data Set 2 (PXDO18252); Extended Data Set 3 (PXDO17857, PXD018158); Extended Data Set 4 (PXDO17855, PXD017856); Extended Data Set 
5 (PXD017858, PXDO17851); Extended Data Set 6 (PXDO17859, PXD017860, PXDO17861). Source data for all proteomics-based plots are provided in Supplementary 
Tables 1-6. All other reagents are available from the corresponding author upon request. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat. pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No sample-size calculation was performed. Sample size was determined to be adequate based on the magnitude and consistency of 
measurable differences between groups and low observed variability between samples. For proteomics experiments, we chose n=3 or n=4 
given the limitation of the available TMT channels. For Flow-cytometry experiments, we analyzed >3000 cells with triplicate experiments 
which showed consistent results through-out the replicates. In-gel fluorescent analyses were also performed at least three times. 


Data exclusions No data were excluded from the analyses. 


Replication All experiments were replicated and all attempts at replication were successful and consistent. For proteomics experiments replicates 
clustered together in PCA, and we observed low coefficient of variation among replicates. 


Randomization | Proteomics samples for comparison with TMT and TMTpro reagents, were randomly allocated in the TMT/TMTpro group and replicates were 
in adjacent channels. For other experiments, no randomization was done. 


Blinding Blinding was not relevant in this study, because all the data were analyzed using unbiased methods. Furthermore, the whole image fields 
were shown in the figures without subjective cropping. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 

n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


[| Human research participants 


[| Clinical data 


Antibodies 


Antibodies used 1. RPS3 Cell Signaling Technology; Cat. No. 9538; RRID:AB_ 10622028; Dilution 1:2000 
2. RPL28 Abcam; Cat. No. ab138125; RRID:AB_10622028; Dilution 1:2000 
3. RPS15a Bethy! Lab; Cat. No. A304-990A-T; RRID: AB 2782308; Dilution 1:1000 
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4. ATG7 Cell Signaling Technology; Cat. No. 8558S; RRID:AB_10831194; Dilution 1:1000 

5. Keima MBL international; Cat. No. M182-3; RRID:AB_10794910; Dilution 1:2000 

6. LC3B MBL international; Cat. No. M186-3; RRID:AB_10897859; Dilution 1:1000 

7. ATGS Cell Signaling Technology; Cat. No. 12994; RRID:AB_ 2630393; Dilution 1:1000 

8. p70 S6K phospho-T389 Cell Signaling Technology; Cat. No. 9234S; RRID:AB_2269803; Dilution 1:1000 
9. phospho-S6 ribosomal protein Ser253, 236 Cell Signaling Technology; Cat. No. 4858; Dilution 1:3000 
10. TEX264 Sigma; Cat. No. HPAQ17739; RRID:AB_1857910; Dilution 1:1000 

11. RPL23 Bethyl Lab; Cat. No. A305-010A-T;RRID:AB_ 2782326; Dilution 1:1000 

12. RPL7 Bethy! Lab; Cat. No. A300-741A-T; RRID:AB 2779432; Dilution 1:1000 

13. RPL29 Proteintech Group; Cat. No. 15799-1-AP; Dilution 1:2000 

14. Tubulin Abcam; Cat. No. ab7291; RRID:AB_2241126; Dilution 1:3000 

15. SQSTM1 Novus Biologicals; Cat. No. HOO008878-M01; RRID:AB_548364; Dilution 1:2000 

16. Anti-puromycin antibody EMD millipore; Cat. No. MABE343; RRID:AB_2566826; Dilution 1:3000 

17. NUFIP1 Proteintech Group; Cat. No. 12515-1-AP; RRID:AB_2298759; Dilution 1:1000 

18. RPS6 Cell Signaling Technology; Cat. No. 2217; RRID:AB_331355; Dilution 1:3000 

19. 4EBP1 Cell Signal Technology; Cat. No. 9644; RRID:AB_2097841; Dilution 1:3000 

20. Lamin A/C Cell Signal Technology; Cat. No. 4777; RRID:AB_10545756; Dilution 1:1000 

21. TFEB Cell Signal Technology; Cat. No. 4042; Dilution 1:1000 

22. YBX1 Proteintech Group; Cat. No. 20339-1-AP; RRID:AB_10665424; Dilution 1:1000 

23. Sestrin 2 (SESN2) Proteintech Group; Cat. No. 10795-1-AP; RRID:AB_ 2185480; Dilution 1:1000 

24. SDAD1 Bethyl Lab; Cat. No. A304-692A-T; RRID:AB_ 2620887; Dilution 1:1000 

25. NVL Proteintech Group; Cat. No. 16970-1-AP; RRID:AB_2157811; Dilution 1:1000 

26. c-Maf RnD systems; Cat. No. MAB8227-SP; Dilution 1:1000 

27. IRDye® 800CW Streptavidin LI-COR; Cat. No. 926-32230; Dilution 1:15000 

28. IRDye 800CW Goat anti-Rabbit IgG H+L LI-COR; Cat. No. 925-32211; RRID:AB_2651127; Dilution 1:15000 
29. IRDye 680 RD Goat anti-Mouse IgG H+L LI-COR; Cat. No. 926-68070; RRID:AB_ 10956588; Dilution 1:15000 
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Validation Specificity (human) for NUFIP1, ATG5, ATG7, RB1CC1 and TEX264 antibodies were determined by testing CRISPR deleted cells as 
controls. For RPS3 and RPL29, specificity for immunoblotting (human) was determined by tagging of the respective endogenous 
gene. The overall immunoblotting results agreed with next-generation sequencing results of NUFIP1, ATGS, TEX264, RPS3, RPL29 
engineered cells and quantitative proteomics results of NUFIP1 and TEX264 deleted cells. For other antibodies, we employed the 
manufacturer's validation provided in individual datasheets. 


Eukaryotic cell lines 


Policy information about cell lines 


Cell line source(s) Human: HEK293 ATCC CRL-1573; RRID:CVCL_0045 
Human: HCT116 ATCC CCL-247; RRID:CVCL_0291 
Human: HEK293T ATCC CRL-3216; RRID:CVCL_0063 


Authentication Karyotyping (GTG-banded karyotype) of HCT116, 293, 

and 293T cells (from ATCC) was performed by Brigham and Women's Hospital Cytogenomics Core Laboratory. 
Mycoplasma contamination All cell lines were found to be free of mycoplasma using Mycoplasma Plus PCR assay kit (Agilent). 
Commonly misidentified lines — none 


(See ICLAC register) 


Flow Cytometry 


Plots 


Confirm that: 


The axis labels state the marker and fluorochrome used (e.g. CD4-FITC). 


The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers). 


All plots are contour plots with outliers or pseudocolor plots. 


A numerical value for number of cells or percentage (with statistics) is provided. 


Methodology ' 

Sample preparation No tissue processing were used. only cultured HCT116 cells gene edited with Halo tags were used for flow cytometry. Cells were 5 

labeled with either TMR (red) ligand or R110 (green) ligand via the Halo enzyme. a 

Instrument LSR-II Analyser, BD Biosciences = 
Software FlowJoTM; V10.5.2 https://www.flowjo.com 


Cell population abundance __ Typically >3,000 cells were analyzed, as indicated. 


Gating strategy 1. singlet cells were gated by SSC1 hight/FSC1 hight (G1) followed by SSC1 hight/SSC1-width (G2). 2. R110 and TMR signal was 
measured by 488ex/513(26)em and 561ex/579(16)em and data exported to prism for ratio-metric calculation 


Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information. 
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® Check for updates 


Zhenwei Zhang", Cindy L. Will?°, Karl Bertram’, Olexandr Dybkov?, Klaus Hartmuth?, 
Dmitry E. Agafonov?, Romina Hofele*®, Henning Urlaub*“, Berthold Kastner’, 
Reinhard Lihrmann?™ & Holger Stark'™ 


The U2 small nuclear ribonucleoprotein (snRNP) has an essential role in the selection 
of the precursor mRNA branch-site adenosine, the nucleophile for the first step of 
splicing’. Stable addition of U2 during early spliceosome formation requires the 
DEAD-box ATPase PRPS? ”. Yeast U2 small nuclear RNA (snRNA) nucleotides that form 
base pairs with the branch site are initially sequestered in a branchpoint-interacting 
stem-loop (BSL)’, but whether the human U2 snRNA folds ina similar manner is 
unknown. The U2 SF3B1 protein, acommon mutational target in haematopoietic 
cancers’, contains a HEAT domain (SF3B1"™") with an open conformation in isolated 
SF3b”, but aclosed conformation in spliceosomes”, which is required for stable 
interaction between U2 and the branch site. Here we report a3D cryo-electron 
microscopy structure of the human 17S U2 snRNP at acore resolution of 4.1A and 
combine it with protein crosslinking data to determine the molecular architecture of 
this snRNP. Our structure reveals that SF3B1"™" interacts with PRP5 and TAT-SF1, and 
maintains its open conformation in U2 snRNP, and that U2 snRNA forms a BSL that is 
sandwiched between PRP5, TAT-SF1l and SF3B1"™". Thus, substantial remodelling of 
the BSL and displacement of BSL-interacting proteins must occur to allow formation 


of the U2-branch-site helix. Our studies provide a structural explanation of why 
TAT-SF1 must be displaced before the stable addition of U2 to the spliceosome, and 
identify RNP rearrangements facilitated by PRP5 that are required for stable 
interaction between U2 and the branch site. 


To our knowledge, at present no high-resolution, cryo-electron 
microscopy (cryo-EM) structure of the human 17S U2 snRNP, the 
major subunit of the spliceosomal Eand A complexes (Extended Data 
Fig. 1), is available. We thus determined its 3D structure by cryo-EM 
(Extended Data Table 1, Extended Data Fig. 2). Consistent with pre- 
vious low-resolution electron microscopy structures of isolated U2 
snRNPs¥, and its overall structure in human spliceosomes® °, human 
17S U2 exhibits a bipartite structure, with a3’ and 5’ domain bridged 
by several density elements (Fig. 1a). The structure of a major part 
of the 5’ domain could be determined at an overall resolution of 
4.1A (Extended Data Fig. 2). However, the 3’ domain and connecting 
bridges, and parts of the 5’ domain are more dynamic and thus less 
well-resolved (Fig. 1a, Extended Data Fig. 2). Because of the limited 
resolution in the more dynamic regions of 17S U2, we used an inte- 
grated structural biology approach, fitting known X-ray structures or 
homology models of structured regions of U2 components into the 
electron-microscopy density map (Extended Data Table 2), in com- 
bination with protein crosslinking coupled with mass spectrometry 
(Extended Data Fig. 3, Supplementary Table 1) and other biochemical 
data, to generate a pseudo-atomic model for the well-defined regions 
of the particle (Fig. 1b). 


SF3B1 has an open conformation in17S U2 


The 17S U2 5’ domain consists predominantly of the SF3b complex. 
SF3B3, PHF5A, SF3B5, and SF3B1""™", which comprise the SF3b core, 
are located in the most well-defined region of the particle (Fig. 1). The 
super-helical structure of SF3B1"*“", which is composed of 20 tan- 
dem HEAT repeats (HRs), exhibits an open conformation in isolated 
SF3b”° (Extended Data Fig. 1d). In B and B** spliceosomal complexes, 
in which U2 stably binds the branch site (BS), SF3B1'""" adopts aclosed 
conformation, encompassing the U2-BS helix and binding the BS 
adenosine (BS-A) ina pocket formed by PHFSA and the HR1S-HRI17 of 
SF3B1 (Extended Data Fig. 1d). In 17S U2 and in the isolated SF3b core 
crystal, SF3B1""" has a very similar open conformation, except that 
HR16 is completely structured in 17S U2 (Extended Data Fig. 4a-c). 
The structural organization of SF3B5, PHF5A and SF3B3 is also highly 
similar (Extended Data Fig. 4d). Crosslinking suggests that SF3B6 (also 
known as P14 or SF3b14a) is also located in a similar region in 17S U2 
and isolated SF3b (Extended Data Fig. 4e). Thus, the structure of the 
SF3b core does not change substantially during 17S U2 assembly and, 
furthermore, the crystal structure of SF3b represents a functionally 
relevant conformation. SF3B1is the target of several precursor MRNA 
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Fig. 1|3D cryo-EM model of human 17S U2 snRNP. a, b, Different views of the 
human17S U2snRNP cryo-EM density map (a) and molecular architecture (b). 
Dark grey denotes better resolved densities. 


(pre-mRNA) splicing modulators’. The open conformation of SF3B1"™"" 
indicates that splicing modulators suchas pladienolide B canalso inter- 
act with it in the 17S U2 snRNP (Extended Data Fig. 4f), and furthermore 
supports the idea that the SF3B1 conformational change first occurs 
during stable U2 addition to the spliceosome”. The U25’ domainis con- 
nected by three main bridges to the 3’ domain (Extended Data Fig. 5), 
which consists of the U2 Sm core RNP, and U2 snRNA stem-loops (SL) 
Illand IV bound by U2-A' and U2-B" (Fig. 1, Extended Data Fig. 5). Similar 
tothe B and B** complexes, SF3a proteins, which comprise part of the 
3’ domain, have a key role in bridging the U2 Sm core with SF3b in 17S 
U251° (Extended Data Fig. 5). 


ABSL forms andis bound by protein 

Inthe spliceosome, SF3b contacts SLIla of U2 snRNA™©'8, On the basis 
of its length and conformation (Fig. 2, Extended Data Fig. 6a, b), we 
could unambiguously place U2 SLIla, as opposed to SLIIb, into one 
of the two well-resolved helical density elements located close to the 
C-terminal HEAT repeats of SF3B1 (Extended Data Fig. 6c, d). The high 
conservation between yeast and humans of U2 nucleotides that form 
the BSL strongly suggest that the latter also forms in the human U2 
snRNP*. The location of the helical density that flanks SLIla indicates 
that it contains nucleotides 5’ of SLIla and, as itis a direct continuation 
of SLIla, that it accommodates the BSL, as opposed to the mutually 
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Fig. 2| A BSL forms in human 17S U2 snRNP. a, 2D structure of human U2 
snRNA. SLI nucleotides are in teal; SLIla nucleotides are in light green. BSL 
nucleotides that later form the U2-BS helix are in orange, and the extended 
region of U2-BS helix are in yellow. Remaining BSL nucleotides are in dark 
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exclusive, extended SLI that is separated from SLIla by 20 nucleotides 
(Extended Data Fig. 6a, b). Indeed, a modelled BSL fits well in this den- 
sity (Extended Data Fig. 6e). Moreover, nucleotides 42 to 45 on one BSL 
strand, which form part of the extended U2-BS helix inthe spliceosome, 
have nearly the same position relative to SLIla and SF3B1, bothin 17S 
U2 and the spliceosome after U2-BS helix formation (Fig. 2). The BSL 
is sandwiched between the SF3B1 C-terminal HEAT repeats and other 
proteins (Fig. 3), and its loop is also sequestered by protein and thus 
inaccessible for base pairing interactions. Indeed, chaperoning away 
reactive regions of spliceosomal RNAs and of the pre-mRNA is acom- 
mon mechanism used by the spliceosome””. The base of the BSL stem 
is contacted bya short helix of SF3A3 (designated the separator helix), 
which ensures an only eight base-pair length of the BSL (Extended 
Data Fig. 6a). This SF3A3 helix also interacts with SLIla and its position 
relative to the latter is maintained in the spliceosome until its catalytic 
activation’*’*'8 (Extended Data Fig. 6g, h). 


PRP5 is near the BSL and encircles SF3B1 


Human 17S U2 snRNPs contain TAT-SF1, the function of which in splicing 
remains unclear, and the DEAD-box ATPase PRP5°”°. Prp5 is required 
for stable U2 addition during formation of the spliceosomal A com- 
plex and facilitates a conformational change in U2 snRNP7’. It is also 
implicated in the displacement of TAT-SF1 (Cus2 in Saccharomyces 
cerevisiae)”, a prerequisite for U2incorporationintothespliceosome*”. 
However, the molecular mechanisms by which PRPS promotes stable 
U2-BS interaction, and the precise nature of RNP rearrangements that it 
facilitates, remain poorly understood. Inthe 17S U2 structure, the PRPS 
RecA domains can be fitted into cryo-EM density close to the BSL, U2 
snRNA SLIla and SLIIb, and adjacent to TAT-SF1 (Fig. 3, Extended Data 
Fig. 7a—c). Consistent with biochemical studies in yeast”’, an extended 
N-terminal a-helix of PRPS interacts with SF3B1 HR9-HRI12 (Fig. 3a, 
Extended Data Fig. 7d). Protein crosslinks suggest that PRP5 residues 
both N- and C-terminal of this a-helix also interact with HR12—-HR15 
and HR1-HRé6, respectively (Fig. 3, Extended Data Fig. 7e). Thus, the 
N-terminal region of PRP5 seems to encompass SF3B1"“", which sug- 
gests that it aids in stabilizing its open conformation. The most common 
(hotspot) cancer-related, point mutations in SF3B1 mainly cluster in 
or near HR6”. Notably, our crosslinking data suggest that the highly 
conserved PRPS DPLD motif, which is essential for stable PRP5 interac- 
tion with U2 snRNP™, is located in the proximity of HR6 (Extended Data 
Fig. 7e), consistent with studies in yeast showing that cancer-related 
SF3B1 mutations directly destabilize binding of PRP5”*”>. 


TAT-SF1 contacts the BSL and SF3B1 


RRM1 of TAT-SF1 is located adjacent to SF3B1 HR15—HRI16 (Fig. 3, 
Extended Data Fig. 7f, g), which are thought to act as a hinge”. Thus, 
TAT-SF1 could potentially inhibit their hinge-like function and pre- 
vent the closure of SF3B1"™" needed to form the BS-A binding pocket. 


Yeast B Human B2*t 


/ U33 
& pre-mRNA 


green. W denotes pseudouridine. Methylation of U2 bases is not shown (see 
Extended Data Fig. 6). b, 3D structure of U2 SLI, BSL and SLIlainhuman17S U2, 
and yeast B (PDB code 5NRL) and human B** (PDB code 6FF4) complexes. 
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Fig. 3 | PRP5 and TAT-SF1 are located near the BSL and interact with 
SF3B1"*4T, a, Location of the PRP5 RecAland RecA2 domains, and interaction 
of the PRPS a-helix with SF3B1"=“", and TAT-SF1RRM1 and UHM with SF3B1"""" 
and the BSL. Dotted line denotes the potential path of unstructured PRPS 
regions, based oncrosslinking data. b, Fit of TAT-SF1®®™ into cryo-EM density 
adjacent to SF3B1HR15-HRI16. A density element that could not be 
unambiguously defined, but is probably part of TAT-SF1, contacts the BSL loop. 
c, The BSL is sequestered by PRPS, TAT-SF1, SF3B1 and the unassigned protein 
region (UPR). Grey denotes cryo-EM density of the unassigned protein region. 
Coloured surfaces are derived from fitted protein models. 


TAT-SF1"*™ is closely associated with a density element that interacts 
directly with the BSL loop and is contacted by PRP5®*“” (Fig. 3). This 
density element is probably part of TAT-SF1, as electron-microscopy 
density can be traced in a continuous manner from it to density com- 
prising TAT-SF1"*™ (Fig. 3). However, we cannot unambiguously identify 
its sequence. Regardless of its nature, this protein module helps to 
stabilize the position of the BSL loop close to SF3B1"™". TAT-SF1 contains 
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SF3A3 U2 3’ 


PRPS - ATP 
UAPS56 - ATP 
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Fig. 4| Model of PRP5-mediated remodelling events leading to stable 
U2-BS interaction. a—e, Model of U2snRNP rearrangements during the 
transition from the spliceosomal Eto Acomplex. UlsnRNP and U2AF are 
omitted for simplicity. a, b, ATP hydrolysis by PRPS leads to the removal or 
displacement of TAT-SF1, the PRP5 RecA domains, and the unassigned protein 
region (grey) from the BSL, and BSL contact with SF3B1""™" is disrupted. This 
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aU2AF homology motif (UHM) (Extended Data Fig. 7a) that has affinity 
for the five U2AF ligand motifs (ULMs) present in the SF3B1 N-terminal 
domain, but preferentially binds to ULMS, located closest to the HEAT 
domain”®. Consistent with protein-protein crosslinks, TAT-SF1“™ can 
be fitted into aless-well resolved density element near the BSL and PRP5S 
RecA domains (Fig. 3, Extended Data Fig. 7c, h). Together, the location 
of PRPS and TAT-SF1 suggests that they aid in stabilizing both the BSL 
and SF3B1 open conformation, and thus must be displaced to allowa 
stable U2-BS interaction. 


Model of PRP5 remodelling of U2 snRNP 


Acomparison of our 17S U2 snRNP structure and the structure of U2 
in the spliceosome indicates that a major RNP rearrangement must 
occur to free those U2 nucleotides that subsequently form the U2-BS 
helix and allow stable U2 addition during formation of the A complex 
(Extended Data Fig. 8a). U2 snRNP first associates with the spliceosomal 
Ecomplex via protein-protein contacts, including those between PRPS 
and the U1 snRNP’, and potentially between U2 proteins and SF1 (also 
known as MBBP) bound to the BS, that would bring U2 into the vicinity 
of the BS” (Fig. 4). Ifaccessible, BSL loop nucleotides could in principle 
form base pairs with the BS without disruption of the BSL®. However, in 
human17S U2, the BSL is sandwiched between the PRPS RecA domains 
and TAT-SF1“™ on one side, and HR16-HR19 of SF3B1 and TAT-SF1°"™" 
on the other, with the unassigned protein region directly contacting 
the BSL loop. PRPS is a DEAD-box ATPase and thus is expected to bind 
double-stranded RNA*”®. However, the nature of the RNA bound by 
PRPS is not clear from our U2 structure and PRPS may first bind RNA 
after 17S U2 interacts with the E complex. Several DEAD-box proteins 
disrupt protein-RNA interactions, in some cases apparently even 
without unwinding dsRNA”. Our data are consistent with a model in 
which, after U2 incorporation into the E complex, ATP binding would 
lead to engagement of PRP5 with the BSL or adjacent RNA nucleotides, 
with hydrolysis leading to displacement or release of BSL-interacting 
proteins, including TAT-SFland the RecA domains of PRPS itself, with- 
out disrupting the BSL (Fig. 4a, b). This would allow initial U2-BS base 
pairing via the BSL loop, which would also require displacement of 
SF1 from the pre-mRNA BS nucleotides to free them for base pairing 
interactions”. In yeast, Prp5 ATPase activity is not required if CUS2 
(the yeast TAT-SF1 homologue) is no longer present’. We thus propose 
that the BSL, whichis no longer stabilized by protein, will subsequently 
become flexible, and thus that its unwinding and the concomitant 
formation of the extended U2-BS helix, may not require exogenous 
energy (Fig. 4b-d). However, at present we cannot exclude that PRPS 
destabilizes the BSL in an ATP-dependent manner, and that this leads 
tothe concomitant displacement of TAT-SF1land other BSL-interacting 


frees the BSL loop for initial base pairing with the BS and leads to 
destabilization of its stem. b, c, Movement of the U2 5’ end downward aids BSL 
unwinding. c, d, Rotation of the SLI-containing U2 5’ end generates helical turns 
required to form the extended U2-BS helix. d, e, Aconformational changein 
SF3B1 clamps the newly formed U2-BS helix within its HEAT domain and leads 
to PRP5 release. NT-helix, N-terminal helix; PPT, polypyrimidine tract. 


proteins. For topological reasons, formation of the extended U2-BS 
helix would require that the 5S’ strand of the BSL no longer contacts 
the SF3B1 C-terminal HEAT repeats. This would be a prerequisite to 
free the SLI-containing 5’ end of U2 snRNA for its proposed downward 
movement (Fig. 4b-d, Extended Data Fig. 8b, c). Rotational movement 
of the 5’ end of the U2 snRNA would then facilitate formation of the 
extended U2-BS helix (Fig. 4b-d, Extended Data Fig. 8c). The latter 
would then move back towards the C-terminal SF3B1 HEAT repeats 
such that U2 nucleotides 42-45 occupy the same position as they did 
before U2-BS formation, and the base of the bulged BS-A would be 
positioned at the hinge region between HR15 and HRI17 and close to 
PHFSA. We propose that docking of the BS-A to the hinge region may 
trigger several coordinated movements of the HEAT domain leading 
to its closed conformation and clamping of its terminal HRs onto the 
extended part of the U2-BS helix (Fig. 4e). 

In yeast, PrpS also functions in an ATP-independent manner during 
formation of the A complex and remains bound to the latter ifthe BS is 
mutated>*°. In 17S U2, the N-terminal region of PRPS makes numerous 
contacts with SF3B1"™' and therefore potentially stabilizes its open con- 
formation. Thus, there may be competition between the maintenance 
of an openconformation, facilitated by the PRPS N-terminal region, and 
the switch to a closed conformation with the BS-A tightly clamped. In 
this model, only if an appropriate U2-BS helix is formed and the BS-A 
is properly docked, will the PRP5-SF3B1"™' interaction be dissolved, 
leading to PRPS release. In this way, PRP5 could potentially proofread 
stable accommodation of the BS-A during A complex formation. Given 
the weak conservation of the mammalian BS, further studies are needed 
to better understand how PRPS can potentially distinguish between 
ideal U2-BS duplexes and those containing several mismatches. 


Online content 


Any methods, additional references, Nature Research reporting sum- 
maries, source data, extended data, supplementary information, 
acknowledgements, peer review information; details of author con- 
tributions and competing interests; and statements of data and code 
availability are available at https://doi.org/10.1038/s41586-020-2344-3. 


1. Will, C.L. & Ldhrmann, R. Spliceosome structure and function. Cold Spring Harb. 
Perspect. Biol. 3, a003707 (2011). 

2. Ruby, S. W., Chang, T. H. & Abelson, J. Four yeast spliceosomal proteins (PRP5, PRP9, 
PRP11, and PRP21) interact to promote U2 snRNP binding to pre-mRNA. Genes Dev. 7, 
1909-1925 (1993). 

3. O'Day, C.L., Dalbadie-McFarland, G. & Abelson, J. The Saccharomyces cerevisiae Prp5 
protein has RNA-dependent ATPase activity with specificity for U2 small nuclear RNA. 

J. Biol. Chem. 271, 33261-33267 (1996). 

4. Abu Dayyeh, B. K., Quan, T. K., Castro, M. & Ruby, S. W. Probing interactions between the 
U2 small nuclear ribonucleoprotein and the DEAD-box protein, Prp5. J. Biol. Chem. 277, 
20221-20233 (2002). 


5.  Perriman, R., Barta, |., Voeltz, G. K., Abelson, J. & Ares, M., Jr. ATP requirement for Prp5p 
function is determined by Cus2p and the structure of U2 small nuclear RNA. Proc. Natl 
Acad. Sci. USA 100, 13857-13862 (2003). 

6. Will, C.L. et al. Characterization of novel SF3b and 17S U2 snRNP proteins, including a 
human Prp5p homologue and an SF3b DEAD-box protein. EMBO J. 21, 4978-4988 (2002). 

7. Xu,Y. Z. et al. Prp5 bridges U1 and U2 snRNPs and enables stable U2 snRNP association 
with intron RNA. EMBO J. 23, 376-385 (2004). 

8. Perriman, R. & Ares, M., Jr. Invariant U2 snRNA nucleotides form a stem loop to recognize 
the intron early in splicing. Mol. Cell 38, 416-427 (2010). 

9. Bonnal, S., Vigevani, L. & Valcarcel, J. The spliceosome as a target of novel antitumour 
drugs. Nat. Rev. Drug Discov. 11, 847-859 (2012). 

10. Cretu, C. et al. Molecular architecture of SF3b and structural consequences of its 
cancer-related mutations. Mol. Cell 64, 307-319 (2016). 

11. Kastner, B., Will, C. L., Stark, H. & Luhrmann, R. Structural insights into nuclear pre-mRNA 
splicing in higher eukaryotes. Cold Spring Harb. Perspect. Biol. 11, a032417 (2019). 

12. Kramer, A., Gruter, P., Groning, K. & Kastner, B. Combined biochemical and electron 
microscopic analyses reveal the architecture of the mammalian U2 snRNP. J. Cell Biol. 
145, 1355-1368 (1999). 

13. Bertram, K. et al. Cryo-EM structure of a pre-catalytic human spliceosome primed for 
activation. Cell 170, 701-713.e11 (2017). 

14. Haselbach, D. et al. Structure and conformational dynamics of the human spliceosomal 
B** complex. Cell 172, 454-464.e11 (2018). 

15. Zhan, X., Yan, C., Zhang, X., Lei, J. & Shi, Y. Structures of the human pre-catalytic 
spliceosome and its precursor spliceosome. Cell Res. 28, 1129-1140 (2018). 

16. Zhang, X. et al. Structure of the human activated spliceosome in three conformational 
states. Cell Res. 28, 307-322 (2018). 

17. Cretu, C. et al. Structural basis of splicing modulation by antitumor macrolide 
compounds. Mol. Cell 70, 265-273.e8 (2018). 

18. Plaschka, C., Lin, P. C. & Nagai, K. Structure of a pre-catalytic spliceosome. Nature 546, 
617-621 (2017). 

19. Papasaikas, P. & Valcarcel, J. The spliceosome: The ultimate RNA chaperone and sculptor. 
Trends Biochem. Sci. 41, 33-45 (2016). 

20. Agafonov, D. E. et al. Semiquantitative proteomic analysis of the human spliceosome via a 
novel two-dimensional gel electrophoresis method. Mol. Cell. Biol. 31, 2667-2682 (2011). 

21. Talkish, J. et al. Cus2 enforces the first ATP-dependent step of splicing by binding to yeast 
SF3b1 through a UHM-ULM interaction. RNA 25, 1020-1037 (2019). 

22. Xu, Y.Z. & Query, C. C. Competition between the ATPase Prp5 and branch region-U2 
snRNA pairing modulates the fidelity of spliceosome assembly. Mol. Cell 28, 838-849 
(2007). 

23. Tang, Q. et al. SF3B1/Hsh155 HEAT motif mutations affect interaction with the 
spliceosomal ATPase Prp8, resulting in altered branch site selectivity in pre-mRNA 
splicing. Genes Dev. 30, 2710-2723 (2016). 

24. Shao, W., Kim, H. S., Cao, Y., Xu, Y. Z. & Query, C. C. A. A U1-U2 snRNP interaction network 
during intron definition. Mol. Cell. Biol. 32, 470-478 (2012). 

25. Carrocci, T. J., Zoerner, D. M., Paulson, J.C. & Hoskins, A. A. SF3b1 mutations associated 
with myelodysplastic syndromes alter the fidelity of branchsite selection in yeast. Nucleic 
Acids Res. 45, 4837-4852 (2017). 

26. Loerch, S. et al. The pre-mRNA splicing and transcription factor Tat-SF1 is a functional 
partner of the spliceosome SF3b1 subunit via a U2AF homology motif interface. J. Biol. 
Chem. 294, 2892-2902 (2019). 

27. Crisci, A. et al. Mammalian splicing factor SF1 interacts with SURP domains of U2 
snRNP-associated proteins. Nucleic Acids Res. 43, 10456-10473 (2015). 

28. Pan, C. & Russell, R. Roles of DEAD-box proteins in RNA and RNP Folding. RNA Biol. 7, 
667-676 (2010). 

29. Liu, Z. et al. Structural basis for recognition of the intron branch site RNA by splicing 
factor 1. Science 294, 1098-1102 (2001). 

30. Liang, W. W. & Cheng, S. C. A novel mechanism for Prp5 function in prespliceosome 
formation and proofreading the branch site sequence. Genes Dev. 29, 81-93 (2015). 


Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations. 


© The Author(s), under exclusive licence to Springer Nature Limited 2020 


Nature | Vol583 | 9 July 2020 | 313 


Article 


Methods 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and investigators were not blinded 
to allocation during experiments and outcome assessment. 


Affinity purification of 17S U2 snRNPs 

HeLa S3 cells were obtained from GBF, Brunswick (currently Helmholtz 
Center for Infection Research) and tested negative for mycoplasma. 
To isolate 17S U2 snRNPs, HeLa nuclear extract was prepared as previ- 
ously described”. Extract was loaded onto a 17-30% (w/v) sucrose gra- 
dient containing GK5M buffer (20 mM HEPES-KOH, pH 7.9, 150 mM KCI, 
5mMMgCl,, 0.2 mM EDTA, pH 8.0) and centrifuged in a Beckman Ti-15 
zonal rotor for 40 h at 32,000 rpm. The gradient was subsequently 
fractionated, and those fractions comprising the 17S peak were pooled 
and used for anti-SF3B1 immunoaffinity purification. Under the con- 
ditions used, the 17S peak is located approximately in the middle of 
the gradient, whereas subsequently formed spliceosomal complexes 
that contain U2 sediment in the 30-45S region, which is near the bot- 
tom of the gradient, ensuring that only isolated 17S U2 particles are 
purified. 17S gradient fractions were diluted 1:1 with GKSM buffer 
and loaded onto an anti-SF3B1 affinity column. After washing with 
GKSM buffer containing 1.5% (v/v) sucrose, bound 17S U2 snRNPs 
were eluted with 0.2 mg mI! SF3B1 peptide (EQYDPFAEHRPPKIAC) in 
GKS5SM containing 1.5% sucrose. Eluates were pooled and separated on 
a5-20% (w/v) sucrose gradient containing GK5M buffer and 0-0.15% 
glutaraldehyde, by centrifuging at 29,000 rpm for 18 hina Surespin 
630 rotor. Gradient fractions were fractionated from the bottom, and 
glutaraldehyde was quenched by adding 100 mM aspartate (pH 7.0) 
to each fraction. 17S U2 peak fractions were concentrated and the 
buffer exchanged with GKSM buffer, by centrifugation with Amicon 
50 kDa cut-off units. SDS-PAGE and mass spectrometry indicated 
that the protein composition of these affinity-purified 17S U2 snRNPs 
is identical to that of previously described, purified human 17S U2 
snRNPs”’. Owing to the labile nature of the 17S particle and the large 
excess of U2 snRNP in nuclear extract, it was not possible to efficiently 
immunodeplete intact 17S U2 snRNPs from HeLa nuclear extract, 
and assay the function of our purified 17S U2 snRNPs by performing 
in vitro splicing complementation assays. 


BS3 crosslinking of 17S U2 snRNPs and crosslink identification 

After elution from the anti-SF3B1 affinity column, purified 17S U2 parti- 
cles were concentrated to 0.6 uM and subsequently incubated with 60 
LM BS3 for 30 minat 20 °C. After quenching by incubating with 10 mM 
Tris-HCl, pH 8.0, for 15 min at 20 °C, crosslinked 17S U2 particles were 
loaded onto a 15-30% (w/v) glycerol gradient containing G150 buffer 
(20 mM HEPES-KOH, pH 7.9, 150 mM KCI, 1.5 mM MgCl.) and centrifuged 
at 4 °C for 14 hat 35,000 rpm ina TH660 rotor. Gradient fractions were 
fractionated from the bottom and crosslinked U2 snRNPs migrating in 
the 17S region of the gradient were pelleted by ultracentrifugation in 
an S15OAT rotor, and analysed as previously described’. Peptides were 
reverse-phase extracted using Sep-Pak Vac tC18 Icc cartridges (Waters) 
and fractionated by gel filtration on a Superdex Peptide PC3.2/30 col- 
umn (GE Healthcare). Fifty-microlitre fractions corresponding to an 
elution volume of 1.2-1.8 ml were analysed in a Thermo Scientific Q 
Exactive mass spectrometer. Protein-protein crosslinks were identified 
by pLink 1.23 and 2.3.5 search engines (pfind.ict.ac.cn/software/pLink) 
and filtered at an FDR of 1% or 5% according to the recommendations 
of the developer”. For simplicity, the crosslink score is represented 
as a negative value of the common logarithm of the original pLink 
score; that is, score = -log,,(‘pLink Score’). An expected maximum 
distance between the Ca atoms of the BS3-crosslinked lysine residues 
is approximately 30 A. The length of most crosslinks (93-95% at the 
spectral level and 85-86% at the unique crosslink level) is <30 A inthe 


presented model of the 17S U2 snRNP (see Extended Data Fig. 3). The 
CXMS data from Supplementary Table 1 can be downloaded and visu- 
alized by interactive 2D viewers such as xiNET http://crosslinkviewer. 
org/or https://xvis.genzentrum.Imu.de/login.php, or in 3D using UCSF 
Chimera or PyMOL. 


Electron-microscopy sample preparation and image acquisition 
The purified 17S U2 sample was absorbed to a thin layer carbon film for 
1 min, which was subsequently attached to R3.5/1 QUANTIFOIL grids. 
Four microliters of ddH,O was applied to the grid and excess water was 
blotted away by an FEI Vitrobot loaded with pre-wet filter paper, with 
the setting of blotting force 8, blotting time 4s, at 100% humidity and 
4°C, and then vitrified by plunging into liquid ethane. Cryo-EM data 
were acquired ona FEI Titan Krios electron microscope (Thermo Fisher 
Scientific), equipped with a Cs corrector, at 300 kV. The images were 
recorded in integration mode ona Falcon Ill direct electron detector 
at 120,700 magnification, which corresponds to a calibrated pixel 
size of 1.16 A at the specimen level. Micrographs were recorded using 
an exposure time of 1s and a total dose of 72 e- per A’. In total, 12,194 
were recorded with 20 movie frames and another 28,226 micrographs 
were recorded with 39 frames. 


Image processing 

Frames were dose-weighted, aligned and summed with Motion- 
Cor2**. The defocus values of the micrographs were determined 
by Gctf*®. Summed micrographs were manually evaluated in the 
COW-MicrographQualityChecker (http://www.cow-em.de). Micro- 
graphs with isotropic Thong rings and clear particles were processed 
further. In total, 27,890 out of 40,420 summed micrographs were 
retained for further processing. Initially, around 3.5 million particles 
were automatically picked using Gautomatch (http://www.mrc-Imb. 
cam.ac.uk/kzhang), then extracted with a box size of 360 x 360 pix- 
els, and binned to 180 x 180 pixels (pixel size of 2.32 A) in RELION 3.0 
(http://www2.mrc-Imb.cam.ac.uk/relion/index.php/Main_Page). Sev- 
eral iterations of reference-free 2D classification were performed in 
RELION 3.0 and ‘bad classes’ showing fuzzy or uninterpretable features 
were removed, yielding around 1.43 million ‘good particles’. A subset 
of 200,000 particles was used to generate an initial 3D map by ab ini- 
tio reconstruction function in cryoSPARC”*. The 3’ domain and other 
peripheral parts of this map were erased in UCSF Chimera v.1.13.1. 
The remaining 5’ domain was low-pass filtered to 40 A and used as the 
3D reference for 3D classification in RELION 3.0. The approximately 
1.43 million good particles from 2D classification were split into five 
datasets. Each dataset was 3D classified into three classes. All 612,445 
particles from good 3D classes (classes that show a complete particle 
with fine details) were combined and subjected to the Refine3D func- 
tion in RELION 3.0, with a mask around the 5’ domain, resulting ina 
5.7 Aresolution map. Next, using the alignment parameters from the 
aforementioned masked 3D refinement, the 612,445 particles were 
focused classified with a mask around the 5’ domain, into 8 classes. 
The best class (with 152,078 particles) that showed clear separation of 
HEAT repeats was selected, re-extracted in original pixel size (1.16 A per 
pixel) with the box size of 300 x 300 pixels, and subjected to another 
round of focused 3D classification. The best 3D class (with 120,070 
particles) was then selected and refined by Refine3D in RELION 3.0 
witha mask around the entire 17S U2 particle or the S’ domain. The final 
reconstruction of the entire particle has an average resolution of 7.1A 
and the reconstruction with a mask around 5' domain has an average 
resolution of 4.1A, based on the RELION gold-standard Fourier shell cor- 
relation. The data were acquired at two different settings (see above). 
Because higher averaged doses tend to result in poorer reconstruction, 
we checked whether removing the particles from the micrographs 
receiving a higher average dose (leading to 82,420 particles retained) 
would improve the numerical resolution and the map quality, but this 
was not the case (data not shown). 


Model building and refinement 

Templates for the U2 proteins and RNA were obtained, wherever pos- 
sible from published structures (Extended Data Table 2). The SF3b core 
crystal structure, together with parts of SF3A3 (amino acids 392-499) 
and SF3B2 (amino acids 458-530, 565-598) were initially docked as 
rigid bodies into the 4.1 A map of the U2 5’ domain. WD40-B of SF3B3 
was then manually adjusted to fit the map using Coot v. 0.8.9.2°°. The 
central part of SF3b core has resolution ranging from 3.6 to 4.5 A, which 
allowed manual adjustment of side chains. The other parts were locally 
adjusted, without manipulating secondary structures or side chains. 
The model of the stem of U2 SLI (nucleotides 12-14, 19-21) was built 
by UCSF Chimera v. 1.13.1” as an A-form RNA helix. The model of BSL 
(nucleotides 25-45) was predicted by RNAvista”’ with base-pairing 
restrains as proposed ina previous study for S. cerevisiae U2 snRNA®. 
U2 SLIla (nucleotides 48-65) was modelled based on the SLIla in the 
yeast B** complex, and adjusted to the human U2 snRNA sequence. 
These RNA models were docked into the map individually as rigid bod- 
ies and connections between the SLs were de novo built using Coot. 
We also built homology models for PRP5 (amino acids 146-195) and 
TAT-SF1®®™ (amino acids 127-220), using SWISS-MODEL suite” or 
SpliProt3D database" and guided by our crosslinking (CXMS) data, 
we then rigidly docked the models into the high resolution map of the 
5’ domain. The better resolution (approximately 5 A) of the TAT-SFI®*™" 
region allowed local adjustment of the homology model. Models of the 
entire 5’ domain were then combined and subjected to real space refine- 
ment in PHENIX™”, with secondary structure restraints. The models of 
the remaining proteins or RNA (Extended Data Table 2) were rigid body 
docked into the low-resolution map of the entire 17S U2 particle, based 
on CXMS data and the overall shape of the EM density. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


The atomic coordinate files have been deposited in the Protein Data 
Bank (PDB) with the following accession codes: U2 5’ domain (6Y50), 
low resolution region (6Y53) and entire 17S U2 particle (6Y5Q). The 
cryo-EM maps have been deposited in the Electron Microscopy Data 
Bank as follows: U2 5’ domain (EMD-10688) and entire U2 particle 
(EMD-10689). 
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Extended Data Fig. 1| Spliceosome assembly cycle and the interaction of 
17S U2 with the pre-mRNA branch site. a, Early assembly and catalytic 
activation pathway of the spliceosome. The 17S U2 snRNP, the structure of 
which was determined here by cryo-EM, is indicated by an asterisk. For 
simplicity, the stepwise interactions of the U1, U2, U4/U6 and US snRNPs, and 
only selected non-snRNP proteins are shown. Helicases involved in conversion 
of the Eto Acomplex are indicated. Inthe Ecomplex, the U2AF65 UHM interacts 
with the ULM of SF1, and after release of SF1, it subsequently interacts witha 
ULM inthe N-terminal region of SF3B1. This swap of UHM-ULM interactions is 
probably very important for positioning the BS before the conformational 
change inthe HEAT domain clamps down on the U2-BS helix and stabilizes the 
U2 snRNP interaction with the intron. The U2AF65/U2AF35 dimer is released 
(not shown) during conversion of the A to B complex”. SF1is displaced from 
the BS by UAPS6 (either before or after PRPS action). SF1 pre-bulges the BS-A via 
accommodation of the latter in its KH domain, facilitating subsequent 
base-pairing of U2 with the BS”. b, Base-pairing interactions between U2 
snRNA and the BS and upstream intron nucleotides, that lead to bulging of the 
BS-A. The sequence shown is from intron 10 of the pre-mRNA for the 
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polypyrimidine tract-binding protein (PTB). Red shading denotes the bona fide 
U2-BS helix, and shows base pairs formed between human U2 snRNA and the 
conserved BS consensus sequence of PTB intron 10. Yellow shading denotes 
the extended U2-BS helix, in which the number and nature of base-pairing 
interactions varies depending on the pre-mRNA intron sequence. c, Schematic 
of the composition of the human 17S U2 snRNP. Only abundant U2 proteins are 
shown?°. d, Open structure of the SF3B1 HEAT domainin the isolated SF3b 
complex (left) and its more closed conformation after interaction with the 
U2-BS helix (right). The SF3B1 HEAT domain (green) formsasuper-helical 
structure, and inthe spliceosome sequesters the U2-BS helix. The 
conformational change in the HEAT domain, which is required to form the BS-A 
binding pocket, was proposed to occur after formation of the U2-BS helix”. 
Before this study, the conformation of SF3B1lin human 17S U2 snRNP was not 
known. The pre-mRNA is in grey; U2 snRNA is coloured as in Fig. 2. For 
simplicity, the PHFSA protein that also forms part of the BS-A binding pocket is 
not shown. The structures of SF3B1"™" in the isolated SF3b complex (PDB code 
SIFE) (left) and human B** complex (PDB code 6FF4) (right) are aligned via 
HEAT repeat 20. 
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Extended Data Fig. 2| Cryo-EM and image-processing of the human 17S U2 
snRNP. a, Computation sorting scheme. All major image-processing steps are 
depicted. For amore detailed explanation, see ‘Image processing’ in 

the Methods. A considerable amount of conformational heterogeneity is 
present in all spliceosomal complexes but even more in the bipartite 17S U2 
snRNP, whichis structurally very labile and readily dissociates during 
purification”, making its analysis by electron microscopy challenging. In 
addition, the bridges connecting the 5’ and 3’ domains of the U2 particle (see 
also Extended Data Fig. 5) havea very flexible character, leading to flexibility in 
the 3’ domain and the large variation in local resolution. Thus, asubstantially 
higher number of particles was needed to generate the 17S U2 structure thanis 
usually used for cryo-electron microscopy. b, Typical cryo-electron 
micrograph of the Homo sapiens 17S U2 snRNP recorded at 120,700x 
magnification witha Titan Krios microscope using a Falcon III direct electron 
detector operating in integration mode ata calibrated pixel size of 1.16 A.c, 
Representative selection of reference-free 2D class average images depicting 
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17S U2 particles recorded under cryo conditions. d, Euler angle distribution 
plot of all 17S U2 particles that contributed to the final structure. Red depictsa 
higher relative number of particles at a certain angle. The generally uniform 
distribution of the particle projection angles ensures an isotropic 3D 
electron-microscopy density map. e, Local resolution estimation of the 5’ 
domain of 17S U2 snRNP. The 5’ domain shows a resolution between 3.6 and 
9.0 A. The map of the remaining part excluding the 5’ domain, shownasa 
translucent overlay, was determined at resolutions between 10 and 30A. 

f, Fourier shell correlation (FSC) of two independently refined half datasets, 
calculated using the ‘PostProcessing’ routine in RELION, indicates a global 
resolution of 7.1A for the entire 17S U2 snRNP, and 4.1A for the masked U2 5’ 
domain. Multibody refinement around the 3’ domain and the peripheral parts 
did not produce better resolved maps for those regions. g, Map versus model 
FSC curves for the 5’ domain and the SF3b core. The FSC=0.5 criterion 
indicates a resolution of 4.2 A for the U2 5’ domain, and 4.15 A for the SF3b core. 
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Extended Data Fig. 3 | Euclidian distances for crosslinks observed between (93-95% at the spectral level and 85-86% at the unique crosslink level) are 
modelled residues (Ca) of the 17S U2 snRNP. a-c, Crosslinks froma single 17S consistent—that is, Ca atoms of the crosslinked amino acids are within 30 A of 


U2 crosslinking experiment (with two technical replicates) wereidentified by each other—in the presented model of the 17S U2 snRNP. The percentage of 
pLink2.3.5 and filtered toa false discovery rate (FDR) of 1% (a), pLink1.23 at an overlength crosslinks (that is, longer than 30 A) is slightly higher than observed 
FDR of 1% (b), and pLink1.23 at an FDR of 5% (c). Calculations were performed for more rigid complexes, which is consistent with the known structural 


using PyMOL2.3.4 for crosslinks with a score of at least 1. Most crosslinks flexibility/dynamics of the 17S U2 snRNP. 
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Extended Data Fig. 4| The SF3B1 HEAT domain has an open conformationin 
the 17S U2 snRNP. a, Fit of the SF3b core proteins into the 17S U2 electron- 
microscopy density (grey). b, Overlay of the HEAT domain (amino acids 529- 
1201) of SF3B1in 17S U2 snRNP (green) and the crystal structure of the isolated 
SF3b core (gold, PDB SIFE). In 17S U2 and isolated SF3b, PHFSA contacts both 

N- and C-terminal regions of the HEAT domain, interacting with HEAT repeats 
HR2-HR3 near its N terminus, as well as HR15, HR17 and HR18 near its 
Cterminus. PHFSA thus contacts two previously described, dynamic hinge 
regions (HR3-HR4 and HR15-HR16) of the HEAT domain”, and thereby helps to 
stabilize the SF3B1 open conformation. c, Close up of SF3B1 HR16in17S U2 
overlaid with that in isolated SF3b. HR16 is completely structured in the 17S U2 
particle, but not in isolated SF3b. d, Overlay of the SF3b core domain in17S U2 
snRNP (green) with the crystal structure of isolated SF3b (gold; PDB code SIFE). 
For clarity, the SF3B3 protein is coloured red-orange in this panel. Although the 
WD40-A and WD40-C domains of SF3B3 have essentially the same 
conformation, and clamp SF3B5 ina similar manner in both the 17S U2 snRNP 
and the SF3B core complex, the WD40-B domain has aslightly different 
position and is rotated more towards SF3B1"™' in 17S U2. e, Multiple crosslinks 
were detected between SF3B6, whichcan be crosslinked to the BS-Ain 
spliceosomal A complexes***’, and SF3B1 residues on the upper surface of the 
HEAT domain, as well as the PRPS a-helix that interacts with HR9-HR12, and 
PHFSA (see also Supplementary Table 1). Crosslinks between SF3B1 and SF3B6 
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were also detected in the N-terminal region of SF3B1 located at, or near, amino 
acids (373-415) that are required for stable SF3B6-SF3B1 interaction***. 
Similar protein-protein crosslinks involving SF3B6 were observed with 
recombinant, intact SF3b complexes”, which indicates that SF3B6 is located in 
asimilar, but not firmly fixed, position bothin17S U2 and the isolated SF3b 
complex’. Numbers (colour-coded to match protein colours) indicate the 
positions of crosslinked lysine residues, which are connected by black arrows. 
The distances between the crosslinked residues in our 17S U2 model are 
indicated by small numbers next to the black arrows. A distance is not included 
if one of the crosslinked residues is present in an unstructured protein region. 
f, The site where the splicing modulator pladienolide B (PlaB) binds SF3B1is not 
occupied in the 17S U2 snRNP. The crystal structure of PlaB bound to the SF3b 
core complex showed that the binding pocket of PlaB, whichis formed by 
HR15-HRI17 and PHFSA, is present only in the open conformation of the HEAT 
domain, and overlaps with the BS-A-binding pocket”. As the PlaB-binding 
pocket is present ina hinge region of the HEAT domain, it was proposed to 
inhibit SF3b function by preventing the conformational change inthe 

HEAT domain needed to clamp down on the U2-BS helix”. There is no 
electron-microscopy density observed in the PlaB-binding pocket and thus it 
and potentially other splicing modulators can also bind SF3B1in the 17S U2 
snRNP. 
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Extended Data Fig. 5| The 3’ domain of the 17S U2 snRNP and molecular 
bridges connecting it to SF3b. a, The U2 3’ domainis connected to the 5’ 
domain by three main bridges. Fit of the entire 17S U2 molecular model into the 
electron-microscopy density (low-pass filtered). b, Fit of the U2 Sm core, U2 
snRNA SLIII, and U2-A' and U2-B” bound to U2 snRNASLIV. The overall structure 
of the U23’ domain does not change substantially after U2incorporationinto 
the spliceosome. The U2Sm core domainis located at a similar distance from 
SF3B1""“" as observed in the human B complex”. The resolution of the 17S U2 
present in the yeast A complex“, as well as in the human pre-B complex’, is 
not sufficient to make meaningful, detailed comparisons of their structure 
with that of our 17S U2 snRNP. Furthermore, in the former complexes the 
molecular architecture of U2 is derived entirely from that found in yeast B or 
human B** complexes. c, Bridge 1is probably composed of U2snRNA 
nucleotides upstream of the Sm-binding site that connect it to SLIIb, which is 
also part of this bridge, as well as unassigned protein density. d, Bridge 2 is 
formed by RRM2 of U2-B” and amino acids in the C-terminal half of SF3A3 that 
bridge U2-B” and the WD40-C domain of SF3B3. e, The N-terminal helical 
domain of SF3A3 contacts the U2 Sm core. SF3A3 also interacts with SF3B2 and 
then extends to the U2 snRNA SLIla and BSL (see Fig. 2). Rigid-body fitting 


combined with protein-protein crosslinking (see f and g) allowed us to localize 
near or within bridge 3, both RRMI1 (amino acids 13-91) and RRM2 (amino acids 
100-179) of the SF3B4 protein. SF3B4°*™" interacts with a short region of its 
binding partner SF3B2 (amino acids 607-693) and SF3B4°®"? extends towards 
the B-sandwich of SF3A2 (amino acids 118-209) and along helical region of 
SF3A1 (amino acids 235-274) that also comprises part of bridge 3. SF3A1 
extends from the N-terminal region of SF3A3 to the SF3A2 B-sandwich. Thus, 
SF3a proteins have important bridging roles in connecting the 3’ and 5’ 
domains of the U2 snRNP. Stable integration of SF3a into the human 17S U2 
particle during its biogenesis requires the prior binding of SF3b” and thus the 
SF3a-SF3b protein contacts described above potentially have key roles inthe 
assembly of the 17S U2 particle. f, g, Intermolecular crosslinks supporting the 
location in our 17S U2 model of SF3B4®"™? (f) and SF3B4®"™! (g). Numbers 
(colour-coded to match protein colours) indicate the positions of crosslinked 
lysine residues, which are connected by black arrows. The distance between 
the crosslinked residues in our 17S U2 model is indicated by small numbers next 
tothe black arrows. A distance is not included if one of the crosslinked residues 
is present inan unstructured protein region. 
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Extended Data Fig. 6| Fit of the various stem-loops inthe 5’ half of U2 
snRNA.a,b, Alternative conformations of stem-loops potentially formed in 
the 5’ part of human U2 snRNA. Thestem of the BSL could potentially form 
additional base pairs, but the presence of the SF3A3 separator helix (amino 
acids Y392 to H400) prevents base-pair formation beyond G25-C45. This 
guarantees that U46 and U47 are single-stranded and thus could potentially be 
involved inthe proposed movement of the BSL away from SF3B1 during U2-BS 
helix formation (see also Extended Data Fig. 8). c, Fit of U2 SLIla, BSL and SLI 
three-way junction into 17S U2 electron-microscopy density. d-f, Fit of the 
individual SLIla (d), BSL (e) and a shortened SLI (f) into the electron-microscopy 
density. Owing to the formation of the BSL, only ashortened U2 SLI can formas 
nucleotides in the lower stem of an extended SLI would instead form base pairs 
located in the lower stem of the BSL. Thus, an extended SLI anda BSL are 


mutually exclusive, competing U2 snRNA conformations. The position of 

the remaining 10 nucleotides at the 5’ end of the U2 snRNA, which inthe 
spliceosome form part of U2-U6 helix II, cannot be discerned. g, Fit ofthe 
SF3A3 separator helix into density at the base of the BSL. h, Similar SLlla RNP 
architecture in 17S U2 snRNP, and the yeast B and human B* spliceosomal 
complexes. Inhuman U2 snRNP, loop nucleotides of SLIla contact amino acids 
of the loop connecting the two a-helices of SF3B1 HR20, as well as residues of 
SF3B2; two regions of the latter (amino acids 458-530 and 565-598) are located 
in well-resolved density close to SLIla and SLI, respectively. These SLIla 
contacts are similar to those found in yeast B and human B** spliceosomes. 
Thus, they area major, direct anchor point for SF3b on the U2 snRNA. The poor 
resolution of the cryo-EM structure of the human B complex in this region does 
not allow for an accurate comparison. 
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Extended Data Fig. 7 | See next page for caption. 


Extended Data Fig. 7 | Fit of PRP5 and TAT-SF1 and their interaction with the 
SF3B1 HEAT domain. a, Schematic of the domain organization of human PRPS 
(left) and TAT-SF1 (right), with amino acid boundaries of each domain indicated 
below. b, Fit of the PRP5 RecAland RecA2 domains inan openconformation 
into the 17S U2 electron-microscopy density. The open (inactive) conformation 
of the PRP5 RecA domains seems to fit better than the closed (active) 
conformation, which suggests that PRPS is inactive in 17S U2 snRNP. However, 
neither the resolution in this region nor intramolecular PRPS crosslinks allow 
us to confidently position these two domains relative to each other and thereby 
distinguish between these two conformations. DEAD-box proteins typically 
bind double-stranded RNA via their RecA domains and facilitate local strand 
separation by introducing one or twosharp bends in one of the bound strands 
that prevent base-pairing with the complementary strand”*. The resolution of 
the 17S U2 inthe region where the PRPS RecA domains are located does not 
allow us to discern whether PRP5S is interacting with RNA at this stage. Onthe 
basis of our structure, the RecA domains of PRPS in17S U2 are not close enough 
tothe BSL to directly disrupt it, and also are not located close enough tothe SLI. 
c, Protein crosslinks support the positioning of the PRPS RecA domains and the 
TAT-SF1"™™. Crosslinks are observed between TAT-SF1""™ and residues in the 

N- and C-terminal HEAT repeats of SF3B1, in the region of PRP5 between the 
N-terminal a-helix and RecA domains, and in PHFSA, which supports the 
location of TAT-SF1"™ in the less-well resolved density element adjacent to the 
U2BSL.d, Fit of the PRPS a-helix (amino acids 146-196) into the electron- 
microscopy density contacting HEAT repeats 9-12. e, Protein crosslinks 
between PRPS and SF3B1 suggest that a region spanning approximately 200 
amino acids located N-terminal of the PRPS helicase domain, wraps around 
most of SF3B1"*4", These data are consistent with studies showing that yeast 
PrpSalso interacts with HRI-HR6 and HR9-HR12 of the yeast SF3B1 
homologue, Hsh155”>. Numbers (colour-coded to match protein colours) 
indicate the positions of crosslinked lysine residues, which are connected by 
black arrows. The proposed path of unstructured regions of PRPS is indicated 
by adotted line and the location of the conserved PRPS DPLD motif is shown. 
The position of selected cancer-related hotspot mutations of SF3B1 (K666, 
K662, K700 and G742) are indicated. Point mutations in SF3B1"™" are linked to 


various cancers and lead to the utilization of cryptic branch sites and 3’ splice 
sites in vivo**°°. The exact mechanism responsible for these changes in 
alternative splicing is currently unknown but it has been suggested that these 
mutations may affect the curvature of the HEAT domain”. Studies in yeast 
showed that position-equivalent, point mutations in SF3B1 that are linked to 
cancer inhumans, lead to loss of stable Prp5 binding to the U2 snRNP2>>. More 
recent studies indicate that in humans these cancer-related mutations 
destabilize the interaction of SF3B1 with the SUGP1 protein™, which, however, is 
essentially absent from our 17S U2 preparations”°. As PRPS appears to 
encompass the entire HEAT domain inthe human 17S U2, itis conceivable that a 
change inthe curvature of the HEAT domain could destabilize the SF3B1-PRP5 
interaction, and the absence of PRPS may directly lead to alterations in BS 
selection by U2 containing SF3B1 cancer-related point mutations. Prp5 was 
shown to contact the SF3b complex in the U2 snRNP viaa conserved DPLD 
motifin its N-terminal domain**. The most common (hotspot) cancer-related 
point mutations mainly cluster in or near HR6” and notably our crosslinking 
data place the DPLD motif of PRPS (amino acids 226-229) in proximity to HR6. 
Thus cancer-related hotspot mutations may disrupt this essential interaction, 
leading to destabilization of PRPS. This, inturn, could havea detrimental effect 
onthe function of PRPS to facilitate the formation of a stable U2-BS interaction 
and/or onthe proofreading activity of PRPS (as discussed above), and thus 
facilitate the usage of aberrant BS and 3’ splice sites. f, Protein crosslinks 
between TAT-SF1®*™! and neighbouring proteins. In 17S U2, TAT-SF1°®™ is 
located adjacent to SF3B1 HR15 and HR16 (Fig. 3a), and thus may also stabilize 
HR16, whichis completely structured in 17S U2 (Extended Data Fig. 4c). 
Crosslinking also suggests that an unstructured loop of SF3B2 consisting of 
amino acids 531-564 may occupy the electron-microscopy density near SF3B1 
HR12-HRI1S5. g, Position of SF3B1 and SF3B2 in 17S U2 (left) and in the human B** 
complex (right). In 17S U2, TAT-SF1"*™! is located in the same position where two 
a-helices of SF3B2 are found in the B** complex, and thus release of TAT-SF1 
would be required for the subsequent formation or repositioning of this region 
of SF3B2. h, Fit of TAT-SF1 UHM (amino acids 260-353) into the 17S U2 electron- 
microscopy density. Amino acids 260-353 of human TAT-SF1 were initially 
designated RRM2, but were later shown to comprise a UHM”. 
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Extended Data Fig. 8| RNP rearrangements and movements of the U2 
snRNA required for formation of the extended U2-BS helix. a, Spatial 
orientation of the BSL and neighbouring proteins in the 17S U2 (left), and the 
subsequently rearranged U2 snRNA and U2 proteins after formation of the U2/ 
BS helix and stable U2 incorporation into the spliceosome (right). Although the 
human B** structure (PDB code 6FF4) is shown, the spatial organization of the 
shown U2 components is similar in Band presumably also human A complexes. 
b, Movement of U2 SLI behind the BSL requires repositioning of the latter. 
Coloured surfaces are derived from the fitted protein models, with the RNA 
depicted as acombination ofa transparent surface model and RNA helix. The 
movement of the 5’ end of U2 required for formation of the extended U2-BS 
helix would require release of the BSL from SF3B1. In the absence of the latter, 
the 5’ end of U2 snRNA, including the short SLI, which is topologically located 
above the BSL stem, would have to be threaded through a very narrow opening 
between the BSL and the SF3B1 HEAT domain in order to unwind the BSL, a 
scenario that is unlikely. Thus, the remodelling of the U2 BSL into an extended 
U2-BS helix will probably occur ina conformational state of the U2snRNPin 
which the BSL is moved away from the C-terminal HEAT repeats. This 
movement could involve a rotation around U2 nucleotides U46 and/or U47, 
which link the BSL to the stem of SLIIa and appear to be maintained ina 
single-stranded conformation by SF3A3 (Extended Data Fig. 6). In this respect, 
it is intriguing that the short SF3A3 separator helix is situated at the same place 
in 17S U2 and in human B** complexes. Moreover, it probably has very similar 
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rolesin17S U2 and the spliceosome. That is, in B*“, the SF3A3 separator helix 
also determines the length of the extended part of the U2-BS helix and 
facilitates the movement of U2 snRNA and the intron away from each other. The 
SF3A3 separator helix probably maintains its contact with the U2 SLIla, 
ensuring the stable interaction of the SF3b complex with the U2 snRNA. This 
has the advantage that the newly formed extended U2-BS complex can swing 
back towards the HEAT domain with SF3A3 still bound to the SLIla. Docking of 
the extended U2-BS helix to the SF3B1 C-terminal HEAT repeats is potentially 
facilitated by initial interactions with the backbone of the first 5-6 nucleotides 
of the intron downstream of the BS.c, Movements of the 5’ end of U2 that 
enable formation of the extended U2-BS helix. Step 1: ATP hydrolysis by PRP5 
disrupts protein contacts with the BSL and presumably also with the 5’ end of 
the U2 snRNA including SLI. This allows base pairing between the BSL loop 
nucleotides and the BS of the pre-mRNA intron. Step 2: the destabilized BSL 
unwinds and the 5’ end of U2 moves behind the BSL and downward, witha 
twisting motion that repositions the pre-mRNA behind the unwound BSL. 
Movement of the 5’ end downward allows the formation of additional base pairs 
with the pre-mRNA. As both ends of the pre-mRNA intron appear to be fixed by 
protein and snRNP interactions, formation of the helical conformation of the 
U2-BS and extended U2-BS probably involves first movement of the 5’ end of 
U2 behind the pre-mRNA (step 4), followed by amovementacross the 
pre-mRNA (that is, atwisting rotation of the 5’ end of U2 around the pre-mRNA) 
(step 5). Numbers indicate the position of selected U2 nucleotides. 


Extended Data Table 1| Cryo-EM data collection, refinement and validation statistics 


Data collection and processing 
Magnification 
Voltage (kV) 
Electron exposure (e7/A?) 
Defocus range (um) 
Pixel size (A) 
Symmetry imposed 
Initial particle images (no.) 
Final particle images (no.) 
Map resolution (A) 
FSC threshold 
Map resolution range (A) 
5’domain 
TAT-SF1L RRM1L 
TAT-SF1 UHM 
PRP5 RecA domains 
3’domain 
SF3A 
Refinement 
Initial model used (PDB code) 
Model resolution (A) 
FSC threshold 
Model resolution range (A) 
Map sharpening B factor (A?) 
Model composition 
Non-hydrogen atoms 
Protein residues 
Ligands 
B factors (A’) 
Protein 
Ligand 
R.m.s. deviations 
Bond lengths (A) 
Bond angles (°) 
Validation 
MolProbity score 
Clashscore 
Poor rotamers (%) 
Ramachandran plot 
Favored (%) 
Allowed (%) 
Disallowed (% 


17S U2 5’ domain 
(EMD-10688) 
PDB 6Y50 


120,070 


17S U2 low resolution part 
(EMD-10689) 
PDB 6Y53 


120,700 


120,070 


17S U2 particle 
(EMD-10689) 
PDB 6Y5' 


120,700 


120,070 
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Extended Data Table 2 | Summary of modelled proteins and RNA in the human 17S U2 structure 


Protein/RNA Chain ID UniProt ID Total residues Modeled Residue Template modeling approach 
SF3A1 6 Q15459 793 160-286 4DGW Docked 
SF3A2 7 Q15428 464 104-209 4DGW Docked 
SF3A3 9 Q12874 501 1-362 4DGW Docked 

392-499 6FF4 Docked and adjusted 
SF3B1 u 075533 1304 399-419 2FHO Docked 
463-1304 6EN4 Docked and adjusted 
SF3B2 8 Q13435 895 458-598 6FF4 Docked and adjusted 
607-658, 681-693 5LSB Docked 
SF3B3 Vv Q15393 1217 1-1204 6EN4 Docked and adjusted 
SF3B4 {e) Q15427 424 11-89, 101-181 5LSB Docked 
SF3B5 x Q9BWI5 86 15-80 6EN4 Docked and adjusted 
PHF5A y Q7RTVO 110 6-98 6EN4 Docked and adjusted 
SF3B6/p14 Z Q9Y3B4 125 17-93 2FHO Docked 
A a PO9661 255 2-163 SMQF Docked 
B’ b P08579 225 6-98 SMQF Docked 
149-225 6FF4 Docked 
Sm D2 h P62316 118 19-116 5MQF Docked 
Sm F i P62306 86 2-75 SMQF Docked 
SmE j P62304 92 14-92 5MQF Docked 
Sm G k P62308 76 3-76 SMQF Docked 
Sm D3 | P62318 126 2-84 S5MQF Docked 
Sm B m P14678 240 4-49, 63-87 SMQF Docked 
Sm D1 n P62314 119 1-82 SMQF Docked 
PRP5 p Q7L014 1031 146-195 Predicted model, Docked 
350-743 ALJY Docked and adjusted 
TAT-SF1 q 043719 755 127-220 Predicted model, docked and adjusted 
260-347 2DIT Docked 
U2snRNA 2 NR_002716 188 12-14,19-21 de novo, Docked 
22-24 de novo 
25-45 Predicted model, Docked 
46-47 de novo 
48-65 5GM6 Docked 
69-73, 81-85 de novo, Docked 
97-184 5MQF Docked 


RNA and protein regions were modelled and fit into the electron-microscopy density as indicated. 
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For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
“AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection Thermo Fischer EPU 2.1, Thermo Exactive MS Series ICSW v.2.2 
Data analysis MotionCor2, Gctf v1.06 , COW-MicrographQualityChecker, Gautomatch v0.56, RELION 3.0, UCSF Chimera v.1.13.1, cryoSPARC v2.1, Coot 
v. 0.8.9.2, SWISS-MODEL suite, SpliProt3D, pLink 1.23, pLink 2.3.5 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- A description of any restrictions on data availability 


The coordinate files have been deposited in the Protein Data Bank as follows: U2 5' domain (6Y50), low resolution region (6Y53) and entire 17S U2 particle (6Y5Q). 
The cryo-EM maps have been deposited in the Electron Microscopy Data Bank as follows: U2 5' domain (EMD-10688) and entire U2 particle (EMD-10689). 
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Validation Anti-human SF3B1 rabbit and mouse antibodies were validated in house for use in Western blotting and for immunoaffinity 
purification. See Will et al., (2002) EMBO J. 21:4978-88. 
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Light-driven sodium pumps actively transport small cations across cellular 
membranes’. These pumps are used by microorganisms to convert light into 
membrane potential and have become useful optogenetic tools with applications in 
neuroscience. Although the resting state structures of the prototypical sodium pump 
Krokinobacter eikastus rhodopsin 2 (KR2) have been solved””, it is unclear how 
structural alterations over time allow sodium to be translocated against a 
concentration gradient. Here, using the Swiss X-ray Free Electron Laser‘, we have 
collected serial crystallographic data at ten pump-probe delays from femtoseconds 
to milliseconds. High-resolution structural snapshots throughout the KR2 photocycle 
show how retinal isomerization is completed on the femtosecond timescale and 
changes the local structure of the binding pocket in the early nanoseconds. 
Subsequent rearrangements and deprotonation of the retinal Schiff base open an 
electrostatic gate in microseconds. Structural and spectroscopic data, in combination 
with quantum chemical calculations, indicate that a sodium ion binds transiently 
close to the retinal within one millisecond. In the last structural intermediate, at 20 
milliseconds after activation, we identified a potential second sodium-binding site 
close to the extracellular exit. These results provide direct molecular insight into the 
dynamics of active cation transport across biological membranes. 


The preservation of sodium gradients across cellular membranes is 
crucial for various biological functions. In living cells, the controlled 
flow of sodium is maintained by a series of specialized membrane chan- 
nels and pumps. For example, glucose uptake in the guts and kidneys of 
mammals is fuelled by asodium gradient, which makes glucose-sodium 
symporters important pharmacological targets in the treatment of 
diabetes’. The opening and closing of voltage-gated sodium channels 
is responsible for the generation and propagation of neuronal signals. 
This has enabled the field of optogenetics, in which light-sensitive 
microbial cation channels from the rhodopsin family are used as a key 
component for the manipulation of physiological responses in neurons 
or even in living animals by light®. 

Rhodopsins are a functionally diverse family of proteins in micro- 
organisms’ and higher organisms® that rely ona retinal chromophore 
to harvest and sense light energy. In 2013, the rhodopsin family was 
extended by the discovery of light-driven sodium pumps in marine 


bacteria, where they maintain a low intracellular sodium ion concentra- 
tion and generate amembrane potential’. In optogenetic applications, 
the controlled light-induced outward pumping of sodium ions leads to 
neuronal inhibition under more physiological conditions compared to 
the use of related proton or chloride pumps’. The optogenetic applica- 
tion of the prototypical member of the class, Krokinobacter eikastus 
rhodopsin 2 (KR2), has been demonstrated in nematodes and cortical 
rat neural cells”. Genetically engineered variants provide further pos- 
sibilities to optimize KR2 for optogenetic applications! >’ ¥. 

The pumping cycle of KR2 has been studied using several 
time-resolved spectroscopic techniques’”* ”. High-resolution struc- 
tures of the resting state have been determined in various forms 
by X-ray crystallography”*”°. However, in these studies the sodium 
substrate is not bound within the retinal binding pocket, indicat- 
ing that the pumping mechanisms is substantially different from 
those in related ion pumps. Additional structural information onthe 
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Fig. 1| Time-resolved absorption measurements on KR2in solution and 
crystals. a,b, Spectra from purified KR2 (a) and in the crystalline phase (b) 
prepared in analogy tothe TR-SFX experiment. Top, experiments inthe 
infrared covering the C=C stretch mode of the retinal up to amide vibrations 
originating from the protein backbone. Middle, changes inthe UV/Vis region 
onthesame sample. Bottom, a global fit analysis of the infrared data reveals the 
presence of intermediate states K,, L/M, O, and O,.c, Model of the photocycle 
of KR2 derived from time-resolved absorption spectroscopy. d, Sodium 
dependency of the 1,516 cm™ marker band for the O intermediate in KR2 
crystals under TR-SFX conditions (top, amplitude of O intermediate; bottom, 
decay of the preceding M intermediate). Time traces were normalized tothe 
ground state bleach signal and fitted by the sum of three connected 
exponentials. 


reaction intermediates is required to understand how sodium can be 
actively transported out of the cell against substantial concentration 
gradients. 

Time-resolved serial femtosecond crystallography (TR-SFX) can 
provide structural snapshots of proteins that can be assembled into 
molecular movies of protein function. A series of classical targets 
have been characterized using this method, including myoglobin”, 
photoactive yellow protein?” and bacteriorhodopsin (bR)*~’. Here, 
we used the Swiss X-ray Free Electron Laser (SwissFEL) to study struc- 
tural changes in KR2 within a wide temporal window from 800 fs to 
20 ms after activation. Ten structural snapshots positioned at tem- 
poral delays coinciding with the accumulation of intermediates iden- 
tified by time-resolved absorption spectroscopy on crystals show 
how the energy captured by the retinal leads to structural rearrange- 
ments. Structural, spectroscopic and quantum chemical data indi- 
cate that sodium binds between N112 and D251 and is released viaa 
second binding site between E11, N106 and E160 on the extracellular 


side. Our integrated work thus elucidates the structural changes 
associated with active transport of sodium ions across a biological 
membrane. 


Photocyclein KR2 crystals 


As a benchmark for the study of KR2 activation in the crystalline envi- 
ronment, we used time-resolved (TR) absorption spectroscopy inthe 
infrared (IR) and ultraviolet/visible (UV/Vis) regions. As under the 
original acidic crystallization conditions KR2 exhibits an accelerated 
photocycle, we developed a soaking protocol to increase its pH inthe 
presence of sodium ions (Extended Data Fig. 1). The treatment changes 
the colour of crystals from blue to red with associated changes in the 
retinal binding pocket. Most importantly in terms of function, KR2in 
treated crystals follows a photocycle (Fig. 1) identical to that observed 
by our absorption spectroscopy on purified KR2, in agreement with 
previous reports®. 

Sodium is expected to bind to KR2 after deprotonation of the retinal 
Schiff base (SB) in the M intermediate, followed by release in the late 
O intermediate, because the transition between these spectroscopic 
intermediates is dependent on sodium concentration”. At acidic 
pH, characteristic O-related bands are absent (Extended Data Fig. 2), 
whereas at a higher pH in the presence of sodium these bands reach 
maximal amplitude a few milliseconds after activation. The TR-IR data 
show that the amplitude and kinetics of the M—O transition depend 
on sodium concentration (Fig. 1d). Clearly, KR2 in treated crystals 
responds to the presence of sodium with a kinetic profile compatible 
with light-driven sodium pumping. Accordingly, all crystals for the 
dynamic measurements described below were rebuffered in the pres- 
ence of sodium before injection across the X-ray laser pulses. 


Structural changes over time 


In three days of beamtime during the first user run of SwissFEL, we 
collected 158,832 dark (pump laser off) and 496,904 light (pump laser 
on) indexable diffraction patterns with an anisotropic resolution up 
to 1.6 A (Extended Data Table 1, Extended Data Fig. 3a). The light data 
were distributed over 10 time delays (At = 800 fs, 2 ps, 100 ps, Ins, 
16ns, 1s, 30 ps, 150 ps, 1 ms and 20 ms) between the optical pump 
and the X-ray probe pulses (Extended Data Fig. 4a). These time delays 
were selected on the basis of our TR-IR data and of previous ultrafast 
stimulated Raman spectroscopic experiments”, to cover critical 
steps in the KR2 photocycle. The serial crystallographic structure of 
the KR2 resting state closely resembles structures solved by conven- 
tional cryo-crystallography (Extended Data Fig. 5). Progressing from 
this starting point, isomorphous difference electron density maps 
(F, (light) — F,(dark)) allowed us to follow structural changes over time. 
Extrapolated data were used to refine the molecular structures for 
each time delay. The light-activated structures follow a continuous 
evolution of structural rearrangements that we combined into five 
stages (800 fs + 2 ps, Ins +16 ns, 30 ps +150 Ls, 1ms and 20 ms) onthe 
basis of root mean square deviations (r.m.s.d.) between the models 
(Extended Data Fig. 4b). These five structural intermediates provide 
direct molecular insights into the sequence of structural rearrange- 
ments during the KR2 pumping cycle. 


Retinal and transmission of light energy 

The first stage in the structural evolution of KR2 activation ranges from 
femtoseconds to picoseconds, with structural rearrangements centred at 
the retinal chromophore, the principal photochemical switch of all rho- 
dopsins. Retinal is covalently bound viaa protonated SB linkage to K255in 
helix Ginthe core of KR2. Similar to other microbial rhodopsins, absorp- 
tion ofa photonin KR2 leads to retinal isomerization at the C13=C14 bond 
(Extended Data Fig. 3b). This photochemical process is faster than inthe 
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related proton pump bR, with formation of the earliest photointermedi- 
ate after only about 200 fs”, which is consistent with a fully isomerized 
13-cis retinal in our earliest difference electron density at At=800 fs and 
after structural refinement at At= 800 fs + 2 ps (Fig. 2a). Within the early 
time delays, we further observed a shift of the water molecule w4.06 and 
the retinal counterion D116; this was similar to, but less pronounced 
than, the ultrafast adaptation of bR®. A clear difference from bR is that 
the isomerized retinal in KR2 points in the opposite direction, with the 
C20 methyl group tilting towards helix C instead of helix G. The direc- 
tion could be pre-determined in the resting state as, compared to bR, 
the retinal polyene KR2 bends in the opposite direction. 

The second distinct stage in the structural evolution takes place in 
the nanosecond range (Fig. 2b), when early conformational changes in 
the protein backbone occur. In bR, adaptation of the energetically more 
favourable planar 13-cis conformation pushes the straightening retinal 
‘upwards’ against W182 in helix F to displace it towards the cytoplasmic 
side**. In KR2, the changes involving the structurally equivalent W215 
are absent; instead, the retinal C20 methyl group pushes ‘sideways’ in 
the membrane plane against V117. Starting at At=1 1s and rising until the 
later microsecond delay, the difference density maps indicate a flip of 
V117 and anestablished transmission of structural changes into helix C 
(Fig. 2c). In this way, the light energy stored in the early photoproducts 
propagates into the seven-transmembrane helical bundle to fuel larger 
conformational changes at later times. 


Sodium translocation and gating 

The third and fourth stages of the structural evolution from micro- 
seconds to milliseconds correlate with the temporal range that is 
relevant for sodium translocation’*”’. Starting from At=1 ps, clear 
electron difference density peaks above 3.50 show how Y218 in helix 
F and S254 in helix G approach the position of retinal in the resting 
state. Changes are further transmitted several turns along helix G 
towards the intracellular side. Structural refinement resulted into small 
shifts of helix C in the order of 1Ain the At= 30 ps + 150 ps structural 
intermediate, and additional changes along helix D occur in the 1-ms 
and 20-ms delays. These rearrangements (Fig. 3) are of particular 
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Fig. 2|Early stepsin the activation 
a ofalight-driven sodium pump. 
a-c, Difference electron density 
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Helix F negative density; blue, positive 
density; shown at 3.50) and 
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At=800 fs +2 ps (a), At=1ns+16ns 
(b) and At=30 ps +150 pls (c). 
d,e, Superposition of structures 
illustrates how retinal isomerization 
translates the light energy into 
structural changes via V117 in helix C 
in case of KR2 (d; grey sticks, resting 
state; blue sticks, structure at 
At=2 ps; green sticks, structure at 
At=1 us) and via W182 in helix Fin 
case of bR (e; dataand coordinates 
previously published**”>; grey sticks, 
resting state; blue sticks, structure 
at At=10 ps; green sticks, structure 
at At=0.8 ps). Arrows indicate 
structural changes discussed inthe 
main text. 
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interest because they are close to the putative entry and exit routes for 
sodium?’. 

One half of the translocation pathway connects the retinal binding 
pocket between helices C, F and G witha water-filled cavity on the intra- 
cellular side. It passes N61 and G263 at the entry side and Q123 of the 
NQD motif (three residues that are important for ion selectivity>””’). 
Native KR2 pumps lithium and sodiumions, but not larger cations such 
as potassium’. In their hydrated forms, sodium is larger than potassium, 
but the dehydrated sodium ionis smaller than potassium”*””. This sug- 
gests that the selective pumping of sodium ions must involve dehydra- 
tion, whichis likely to happen at the entry to the conducting pathway””. 

The narrowest part of the sodium translocation pathway in the 
light-activated structures of KR2 passes between the retinal and the 
side chains of D116 and K255, which act as a counterion and a covalent 
link for the retinal, respectively. Neutralization of the retinal SB through 
protontransfer and transient widening at this position is likely to act as 
an electrostatic gate (Fig. 4a, Supplementary Video 1) that allows the 
selective passage of cations”*”", The distance between the retinal SB and 
D116 contracts by about 0.5 A at At=1ns +16 ns, which favours proton 
transfer in the transition to the M intermediate”. At At=30 pts + 150 ps, 
the distance widens again in agreement with the spectroscopic data, indi- 
cating deprotonation of the SB and occurrence of the M state in the early 
microseconds (Fig. 1). At1.4 A, the opening seems small for a1.9 Asodium 
ion to pass. At this point we cannot exclude the possibility that the gate 
does not fully open in our crystals (which are formed from monomeric 
KR2), because pentameric KR2 can adopt a more open conformation 
in the SB region*”° and mutations in the oligomerization interface can 
affect pumping efficiency’””° (for further discussion, see Extended 
Data Fig. 5). However, stable structural intermediates that accumulate 
intime-resolved studies of molecule ensembles do not necessarily reveal 
all functional steps” and our integrated structural, spectroscopic and 
computational analysis is compatible with sodium binding in the later 
stages (see below). It is most likely that the approach ofa sodiumionand 
electrostatic weakening of the helix C-SB interaction allows a transient 
pathway to form within the L-M equilibrium in the microseconds. 

The fifth stage in the late milliseconds is related to changes in the 
section of the translocation pathway that extends from the retinal 
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Fig. 3 | Structural changes along the sodium translocation pathway. 

a, Overview of the KR2 structure with the suggested route of sodium 
translocation across the membrane identified using the program Caver”? (blue 
and red surfaces for the two halves plotted onthe structure at At=1ms with 
changes in colour indicating functionally critical regions). Selected residues 
and the retinal chromophore are shownassticks. b, Structural refinements and 
comparison to the resting state show how the protein rearranges from 
nanoseconds to milliseconds (blue-to-red gradient and ribbon width indicate 
r.m.s.d. to the resting state). c-f, Close-up views into the region of the retinal SB 
(c, d) and the extracellular side of the membrane (e, f). The structural 
rearrangements are compatible with the formation ofa transient sodium 
binding site between N112 and D251 at At=1ms (c) andasecond site further 
along the translocation pathway between E11, N106 and E160 at At=20 ms after 
photoactivation (f). Difference electron density maps (F, (light) — F,(dark); 
gold, negative density; blue, positive density; countered at 3.20); arrows 
highlight structural changes discussed in the main text. 


binding site towards the extracellular side of the protein. Here the 
bottlenecks run along the side chain of R109, consistent with muta- 
tions at this position transforming KR2 into alight-gated inward-facing 
potassium channel*. A rotamer change of R109 and Q78, together with 
a shift of helix D, at At=20 ms indicates an opening that connects the 
water-filled cavity in the vicinity of the retinal with a second water-filled 
cavity close to E11, N106, E160 and R243, towards the exit site on the 
extracellular side of the membrane. 


Formation of sodium binding sites 


In the alternate access model of active membrane transport, the sub- 
strate is bound while the protein rearranges to allow release without 
backflow. Light-driven sodium pumps suchas KR2 donot bind substrate 
close to the retinal SB in the resting state’ *. Hence, it remains unclear 
where the transported sodium ion is located and how it moves across 
the membrane with time. Time-resolved IR spectroscopy ona protein 
film (our results and published work’) and crystals provides evidence 
(through a marker band at 1,688 cm”) of changes in the environment of 
an asparagine residue that peak in the O intermediate within millisec- 
onds under sodium pumping conditions (Fig. 1, Extended Data Fig. 2). 
Mutagenesis of N112 and D251 close to the retinal binding pocket abolish 
sodium pumping”**”*, and these two residues have been suggested 
to be potential sodium coordination partners, based on molecular 
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Fig. 4| Electrostatic gating mechanism. a, Circles show the retinal binding 
pocket at critical steps in the KR2 photocycle. Blue, positive charge of the 
retinal SB; red, negative charge at the D116 counterion; arrows indicate steps in 
sodium translocation including light-induced retinal isomerization and 
conformational rearrangements in the retinal binding pocket, entry ofa 
sodium ion after proton transfer from the SB to the counterion, binding of the 
sodium ion and reprotonation of the retinal SB andion release. b,c, QM/ 
MM-optimized geometry with water (b) or sodium ion (c) inthe binding site 
between N112 and D251. The spectral shift is only in line with the spectroscopic 
dataon the redshifted O intermediate (compare Fig. land Extended Data 

Fig. 6h) when sodium is included. The shift is not a direct effect but is due to the 
change in the D116-SB interaction through N112 viathe shown 
hydrogen-bonding network (dashed lines). For adynamic illustration of the 
described rearrangements see Supplementary Videol. 


dynamic simulations” and structural comparisons to other rhodopsins*® 
(Extended Data Fig. 6). Analysis of our TR-SFX data shows a notable 
evolution of the electron density in this region (Fig. 3c, d). At At=1ms, 
aclear positive difference peak is located about 1 A away from w406 
in the resting state and close to N112 and D251. The electron densities 
of water and sodium ions have an identical signature in X-ray crystal- 
lographic data, but the environment can provide clues to the nature of 
the detected atoms”. At At=1ms, the density peak is shifted away from 
the amine group of W113 and the positively charged R109, both of which 
coordinate w406 in the resting state but cannot bind sodium. The new 
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position is at a distance of 2.5 A from N112 and D251, which is close to 
the ideal coordination distance for asodium ion*’, concurring with our 
spectroscopic data. We validated the position of the putative sodium 
ion using hybrid quantum mechanics/molecular mechanics (QM/MM) 
calculations. Inclusion of the sodiumin the At=1ms structure resulted in 
alarge spectral red shift of 55 nm with respect to the electronic absorp- 
tion band of the resting state, whereas placing water at this position 
yields an 11-nm blue shift (Fig. 4b, c). The experimental value obtained 
from the transient spectroscopic data is a red shift of 66 nm; hence, 
inclusion of the sodium ion is necessary to reproduce the absorption 
maximum of the O intermediate (A, =592 nm). The absorption shift is 
nota direct effect, but results from changes in the D116-SB interaction 
through N112. It is reasonable to suggest that, through alteration of this 
H-bonding network, sodium binding favours reprotonation of the SB 
and thereby blocks the backflow of ions. 

At At=20 ms, the electron density close to the retinal binding pocket 
fades below the 3a level, indicating release of the sodium ion. Further 
along the translocation pathway we observe the formation of a second 
sodium binding site close to the extracellular side of the membrane. Here, 
aclear positive difference peak appears between E11, N106 and E160. In 
thesametemporal regime, the shift of R243 moves a positive charge away 
to facilitate sodium binding (Fig. 3e, f). Again, the coordination distances 
of 2.4 Ato N106 and 2.5 Ato Ell support our assignment of a sodium ion 
inthis putative binding site. Notably, both sodium binding sites use dis- 
placements of arginine residues to favour binding of sodium over water. 
The corresponding positions of these residues have functional equiva- 
lents in bR (Extended Data Fig. 6), with R109 close to the retinal binding 
pocket corresponding to R82, whichis critical for proton transfer in bR. 
The position of R243 in KR2 is occupied by E194 in bR, whichis part of the 
proton release group. Some key sites in the seven-transmembrane helical 
bundleseemto be functionally conserved throughout evolution but are 
approached at different times within their respective pumping cycles. 


Conclusions 


Our data have allowed us to assemble a molecular movie of structural 
changes in KR2 and to propose a basic mechanistic model of light-driven 
sodium transport (Supplementary Video 1). The unidirectional flow 
of ions is achieved by minimal structural changes that generate ion 
selectivity and prevent ion back leakage into the cell. Our observa- 
tion of active ion pumping is consistent with general concepts of ion 
pumping across a biological membrane by the alternate access model, 
illustrating them with high-resolution structures of the intermediate 
steps. Future studies will investigate how pH and long-range coopera- 
tive effects between protomers influence these structural dynamics. 
X-ray lasers now provide the means to study how single point muta- 
tions allow the translocation of larger ions such as potassium? and 
caesium” or turn KR2 from an active pump into a passive channel™. 
Deeper insights into the transport mechanisms found in microbial 
rhodopsins will demonstrate how evolution has adapted acommon 
leitmotifto different functions and facilitate the design of variants for 
neurobiological applications in optogenetics. 
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Methods 


Cloning, protein expression and purification 

The KR2 construct with a TEV-cleavable C-terminal 6xHis-tag was 
cloned into the pStaby1.2 vector (Delphi Genetics). Protein expres- 
sion was performed in C41(DE3) Escherichia colicells. The cells grewin 
shaking smooth flasks with Luria broth at 37 °C. Expression was induced 
by addition of 1mM B£-D-thiogalactopyranoside (IPTG) at an optical 
density at 600 nm (OD, ,) of ~0.8. Following overnight expression at 
37 °C in the presence of 10 pM all-trans retinal, the bacterial cultures 
were harvested by centrifugation at 5,000g for 15 min. The cell pellets 
were disrupted with Avestin EmulsiFlex-C3 homogenizer at 15,000 psiin 
lysis buffer (20 mM Tris pH 8.0, 5% glycerol, 0.5% Triton X-100, 5 pg/ml 
DNase land cOmplete protease inhibitor tablets, Roche), and the mem- 
brane fraction was collected with ultracentrifugation at 90,000g. The 
membrane pellet was resuspended with IKA T 25 Ultra-Turrax disperser 
in solubilization buffer that contained 50 mM Tris pH 8.0, 300 mM 
NaCl, cOmplete protease inhibitors, 1.0% n-dodecyl B-D-maltoside 
(DDM, Anatrace), and 0.2% cholesteryl hemisuccinate (CHS, Anatrace), 
and stirred overnight at 4 °C. The overnight suspension was subjected 
to asecond round of ultracentrifugation before the supernatant was 
applied to immobilized metal affinity chromatography (IMAC), and 
further washed with IMAC buffer (SO mM Tris pH 8.0, 150 mM NaCl, 
100 mM imidazole, 0.02% DDM, 0.04% CHS). The bound protein was 
eluted by addition of 500 mM imidazole in IMAC buffer. TEV protease 
was added, and the KR2-TEV cleavage solution was sealed inan8kDaM,, 
cutoff dialysis membrane and dialysed against 50 mM Tris pH 8.0, 150 
mM NaCl, 0.02% DDM, 0.04% CHS buffer overnight. The TEV-cleaved 
solution was re-applied to the IMAC column, and the flow-through 
was collected and concentrated with a centrifugal filter device (Milli- 
pore 100kDaM, cutoff). The concentrated protein sample was loaded 
onto a HiLoad Superdex 75 prep grade 16/600 column (GE Healthcare) 
equilibrated with SEC buffer (SO mM Tris pH 8.0, 150 mM NaCl, 0.05% 
DDM, 0.01% CHS). The elution profile was monitored at 280 nm and 
530 nm with Shimadzu UV-2401PC spectrophotometer, and the pur- 
est fractions were concentrated to ~100 mg/ml, flash frozen in liquid 
nitrogen and stored at —80 °C until further crystallization. 


Crystallization and TR-SFX sample preparation 

Crystallization was carried out in lipidic cubic phase (LCP) using con- 
ditions similar to those described’. The purified protein buffer and 
monoolein (1-oleoyl-rac-glycerol, Nu-Chek prep) were thoroughly 
mixed ina 4:7 v/v ratio through coupled gas-tight Hamilton syringes. 
The formed LCP was extruded through Hamilton needles into plastic 
B-Braun Omnifix-F syringes loaded with precipitant (200 mM sodium 
acetate pH 4.4, 150 mM MgCl,, 35% PEG 200). Crystallization occurred 
overnight in the dark at 20 °C and yielded plate-like blue KR2 crystals 
with dimensions of 10-30 x 10-25 x 1-3 pum? (for a size distribution, 
see Extended Data Fig. Ic). 

Following formation of crystals, the precipitant solution was washed 
out by soaking the LCP in excess 150 mM NaCl, 35% PEG 200 solution, 
two times for 48 h in total. The washed phase, with light-blue unbuff- 
ered crystals, was further harvested into gastight Hamilton syringes 
in 60 pl fractions, and doped with 33 pI monoolein and 3.0 pl 50% PEG 
1500 to form a stable jetting phase. Before data collection, the phase 
with crystals was mixed with LCP prepared from monoolein and1M 
Tris pH 9.0, 150 mM NaCl, 35% PEG 200 through a three-way syringe 
coupler*®. The volumes of the mixed phases were picked such that the 
water fraction of the mesophase would contain 200 mM Tris, 150 mM 
NaCl, 35% PEG 200 and PEG 1500 and the blue KR2 crystals changed 
colour to red upon mixing (Extended Data Fig. 1). While the mixing was 
done with Tris solution at pH 9.0, the final pH of the preparation was 
close to pH 8 as confirmed by litmus paper. We attribute this shift by 
one pHunit and the shift in colour transition compared to KR2 solution 
to possible residual buffer trapped in LCP and/or the buffering capacity 


of monoolein molecules. The jetting stability of the mixed phase was 
confirmed before the XFEL experiment with a high-speed camera setup 
as described*? (Supplementary Video 2). 


Experimental setup and XFEL data collection 

The TR-SFX data on KR2 crystals were collected in February 2019 in 
three days at the Alvra experimental station of SwissFEL. X-ray pulses 
with a photon energy of 12 keV and a pulse energy of 400 pJ at arep- 
etition rate of 50 Hz were used for the experiment. On average 180 uJ 
(9 x 10"° X-ray photons) per pulse were deposited onto a 2.6 x 3.6 tum? 
spot (FWHM), focused by Kirkpatrick-Baez mirrors. The X-ray intensity 
was adjusted using solid attenuators to maximize diffraction signal 
without disrupting the sample injector flow or damaging the detector. 
To reduce X-ray scattering, the air inthe sample chamber was pumped 
downto 100-200 mbar while being substituted with helium. To reduce 
the amount of data the Jungfrau 16M detector was run in 4M mode 
excluding the outer panels. 

KR2 crystals were loaded into a high viscosity injector connected 
toan HPLC pump”. The crystals were extruded into the pump-probe 
interaction point through a 75-um capillary at a flow rate of 3.35 pl per 
minute. In the interaction point, the probe XFEL beam intersected 
with a circularly polarized pump beam originating from an optical 
parametric amplifier producing laser pulses with 150 fs duration (1/e’), 
575 nm wavelength and 3 tJ total energy ina focal spot of 80 x 80 pm? 
beam (1/e), corresponding to a maximal laser fluence of 59 mJ/cm? 
and laser power density of 397 GW/cm’. Approximating the dose per 
KR2 molecule with the Lambert-Beer law for an average 19 x 16 x 2 
tum’ sized crystal, we estimated that 8.3-3.5 photons per retinal are 
absorbed, depending on the crystal’s orientation and the position of 
the individual chromophore within the optically dense crystal. How- 
ever, the average photon dose is certainly lower because scattering 
and reflection on the extruded material further reduce the dose, with 
estimates ranging from 20%” to 90%”. Our previous best estimate was 
80% for TR-SFX experiments on bR”’, which would reduce the calculated 
doses to 1.7-0.7 photons per retinal. Another point to consider is that 
the Sn<S1 excited-state absorption is in the 400-500 nm region”. As 
such, the excited state is unlikely to absorb a 575-nm photon, further 
minimizing the chance of multi-photon absorption. 

To cover the KR2 photocycle time delays between the pump laser 
and the probing, XFEL pulses were chosen at At= 800 fs, 2 ps, 100 ps, 
Ins, 16 ns, 11s, 30 ps, 150 pts and 1 ms. An additional time delay at 20 
ms was created by shifting the laser focus position and using the 50 Hz 
XFEL repetition rate to create a delay to the pump laser. 

Every fifth pulse of the pump laser was blocked, so that a series of 
four light-activated and one dark diffraction pattern were collected in 
sequence. Roughly 50,000 light-activated patterns were collected for 
each time delay, a high-quality dark data set was obtained by merging 
patterns of the fifth pulse, and ~50,000 patterns with pump laser off. 
These laser off images were also acquired for comparison, to confirm 
that the dark data in each cycle was not illuminated. Finally, about 
40,000 patterns were obtained with no laser activation for untreated 
crystals at acidic pH, to compare with crystals soaked as described 
above. For clarity, the applied data collection scheme is illustrated in 
Extended Data Fig. 4. 


Data processing 

All data were indexed, integrated and merged using Crystfel*?** version 
0.8.0. Data were indexed using the xgandalf* and taketwo* algorithms. 
Data were integrated using the --rings option in indexamajig. Patterns 
were merged using partialator with the following options: --model = 
unity, --iterations =3. No per-pattern resolution cutoff was applied. Data 
showed diffraction anisotropy of about 2.2 x 2.2 x1.6 Aalong a*, b*and 
c*. The general resolution cutoff of 1.6 A was chosen after evaluating all 
uncut data sets using the staraniso server’, showing a maximal reso- 
lution of 1.6 A for most data sets (Extended Data Table 1). Diffraction 
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intensities of each individual data set output from partialator were cut 
at 1.6 Aresolution and subjected to the staraniso server again. The sta- 
raniso server determines the resolution cutoff for each data set based 
onthe signal-to-noise ratio in a given resolution shell and then truncates 
data along these directions; inthe case of KR2 this resulted in data trun- 
cation very close to the given ellipsoid dimensions. It is currently not 
possible to generate a merged data set from serial crystallographic data 
using staraniso, so we generated data sets containing only reflections 
that were kept by staraniso and generated statistics from them using 
CrystFEL. Refinements for dark data collected at acidic and neutral 
pH were carried out to the full resolution range against data obtained 
after anisotropic truncation. Since data extrapolation lowers the data 
quality, we lowered the resolution cutoff for structural refinements 
of the combined data (800 fs + 2 ps, 1ns +16 ns and 30 us + 150 ps) to 
2.25 Aresolution and of the smaller datasets at1 ms and 20 msto2.5A. 
The truncated data sets were deposited to the worldwide Protein Data 
Bank (wwPDB)** together with the structures (Extended Data Table 2) 
refined against original (resting acidic pH and resting neutral pH) and 
extrapolated data (800 fs + 2 ps, 1ns +16 ns, 30 ps + 150 pts, 1ms and 
20 ms). 


Calculation of difference density maps 

All F, (light) — F,(dark) difference maps were calculated using PHENIX” 
using the multi-scaling option excluding amplitudes smaller than 30 
and resolutions lower than 10 A in the anisotropy corrected data and 
phases of the refined neutral resting state. 


Structure determination and refinement of KR2 resting state 

The structure of the KR2 neutral resting state was solved using molecu- 
lar replacement with PDB accession 3X3C’ as search model. The struc- 
ture of the KR2 acidic resting state was refined directly using the neutral 
model as a starting point. Structural refinements were done using 
PHENIX” with iterative cycles of manual adjustments made in Coot®. 


Data extrapolation 

Extrapolated data were calculated using the anisotropy corrected data 
and a linear approximation as follows: Fey, = 1OO/A x (F, (light) - 
F,(dark)) + F,(dark), where A is the activation level in per cent and Fryer 
is the extrapolated structure factor amplitude. F,(light) was scaled 
to F,(dark) before calculation of F,,.,,. The activation level A (the per- 
centage of molecules that neither stayed in nor returned to the dark 
state after the laser pulse) was determined by calculating extrapolated 
maps with phases of the dark state and light data at different activation 
levels in steps of 2% in the calculation of F,,,-,, until features of the dark 
state appeared at the retinal. On the basis of this analysis, we chose an 
activation level of 14% for extrapolated maps. 


Refinement of time-resolved states 

Negative amplitudes resulting from the extrapolation procedure were 
removed from the TR-SFX data before model building and refinement. 
The models were manually adjusted to best fit observed difference 
map features as well as extrapolated maps. For residues with multiple 
conformations present in the resting state, the prevalent conformation 
was chosen based onelectron density maps, followed by refinementin 
PHENIX”. Initial models were refined against extrapolated data from 
all ten different time delays. Using a pairwise comparison at all time 
delays (Extended Data Fig. 4b) of structures based on their r.m.s.d., 
in combination with manual inspection of electron density maps and 
input from the time-resolved spectroscopic data (Fig. 1), we identi- 
fied structural transition points and combined delays at 800 fs with 
2 ps, Ins with 16 ns and 30 ps with 150 ps to further improve density 
maps and refinement statistics and model quality for the first three 
deposited intermediates (Extended Data Table 2). In the final refine- 
ment of the two delays inthe millisecond regime, restraints on sodium 
distance were used. However, this only marginally affected results, as 


the sodium atom close to the retinal binding pocket refined to a dis- 
tance of 2.4 A to N112 and 2.7 A to D251 with an average near identical 
to the 2.5 A from restrained refinement. A similar result was obtained 
within the second binding pocket where the sodium atom refined to 
a distance of 2.4 Ato N106 and 2.5 A to Ell with constrains and 2.4 Ato 
both residues without them. 


Time-resolved spectroscopy 

Crystals for spectroscopic characterization were prepared ina similar 
fashion to those used in the time-resolved SwissFEL experiments. A 
few microlitres of crystals immersed in LCP (lipidic cubic phase) were 
sandwiched between two BaF, windows and sealed with vacuum paste 
immediately after extrusion to prevent drying™. Similar to our previ- 
ous studies’, KR2 solubilized in 0.02% DDM, 0.004% CHS with 100 
mM Tris pH 8.0, 150 mM NaCl, was dried from a concentrated protein 
solution ona BaF, window. The dried film was rehydrated via the vapour 
phase generated by a glycerol:water mixture in a 3:2 ratio and sealed 
with a second window using vacuum grease. 

Time-resolved IR experiments were recorded on a home-built 
spectrometer based on tunable Quantum Cascade Lasers (QCLs) as 
described*™. We traced transient absorption changes in the frequency 
range of 1,510-1,690 cm‘ in steps of 2cm™ for crystals and 11cm for 
protein films across the time range of 5 ns to 200 ms. The repetition 
rate was set to 2 Hzand each kinetic was averaged 25 times. After reach- 
ing 1,690 cm‘, the scanning direction was reversed and the two data 
sets were merged, accounting for possible protein bleaching. The 
frequency-doubled emission of a Q-switched Nd:YAG (neodymium 
yttrium aluminium garnet, Minilite; Amplitude) laser emitting at 532nm 
was used for photoactivation with an energy density set to ~3 mJ/cm’. 

Absorption changes in the visible range were recorded using acom- 
mercial flash photolysis setup (LKS70; Applied Photophysics) essen- 
tially as described®. The photoreaction was induced by a short laser 
pulse emitted by a Nd:YAG laser (Quanta-Ray; Spectra-Physics), which 
drives an optical parametric oscillator tuned to 532 nm with an energy 
density of -3 mJ/cm’. Transients were recorded from 380 to 650 nmin 
10-nm steps omitting the wavelength around the exciting laser pulse 
owing to light scattering (that is, 510-550 nm). Each trace was recorded 
10 times with a repetition rate of 2 Hz and subsequently averaged. 

Time-resolved step-scan and rapid-scan FTIR experiments on KR2 
films were conducted using a Vertex 80v spectrometer (Bruker Optics). 
The excitation laser source was the same as for the QCL measurements 
with a repetition rate of 2 Hz for the step-scan mode. In the case of 
rapid-scan experiments, a slower repetition rate of 0.018 Hz was used 
to take account of the slower photocycle of KR2in the presence of KCl. 

We reconstructed the recorded data by applying singular value 
decomposition. Kinetic analysis was done by fitting the data to a model 
consisting of a unidirectional sequence of states. This yields aconcen- 
tration profile of the involved and spectroscopically observable states 
over the course of our measurement”. 


Hybrid QM/MM calculations 

Crystallographic coordinates from TR-SFX were used as the initial input 
for the hybrid QM/MM calculations». A subsystem of the protein was 
chosen and treated using a quantum chemical method (QM region), 
while the remaining part was treated using a classical force field (MM 
region), namely AMBER ff14SB*. The QM region includes the retinal 
chromophore, the side chain of K255, and side chains of residues D116, 
N112, and D251. The link atom was placed at the QM/MM boundary 
between the Ca and Cf atoms of the amino acids. To test the effect of 
sodium binding we included either a sodium ion or a water in the QM 
region of the 1-ms structure. 

The geometry optimization was performed at the B3LYP/cc-pVDZ/ 
AMBER level of theory. In these simulations, the protein backbone 
position was fixed to the crystallographic structure, whereas the QM 
region and side chains of amino acids within the 5 A region of the retinal 


protonated Schiff base in the MM region was relaxed. Resolution of 
identity for Coulomb integrals (RI-J) and chain of sphere approximation 
for the exchange integrals were applied (RIJCOSX)*”. Correction for 
dispersion was included with D3/B-J damping variant®. The calculation 
of vertical excitation energies used the simplified TD-DFT method” 
developed by Grimme and co-workers at the CAM-B3LYP/cc-pVTZ 
level of theory. All the QM/MM calculations were performed with the 
quantum chemistry package Orca interfaced with the DL POLY module 
of ChemShell® for the force field. 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Resting state coordinates and structure factors have been deposited 
in the PDB database under accession code 6TK7 (acidic pH) and 6TK6 
(neutral pH). Together with the neutral pH resting state structure, the 
light-activated data for all time points (800 fs, 2 ps, 100 ps, Ins, 16ns, 
Ips, 30 ps, 150 ps, Ims and 20 ms) were deposited inthe mmCIF file. For 
the refined structures using combined data (800 fs + 2 ps, Ins +16 ns 
and 30 us +150 ps) and single (1ms and 20 ms) light-activated datasets, 
coordinates, light amplitudes, dark amplitudes and extrapolated struc- 
ture factors have been deposited in the PDB database under accession 
codes 6TKS (800 fs + 2 ps), 6TK4 (1ns + 16 ns), 6TK3 (30 ps + 150 us), 
6TK2 (1ms) and 6TK1 (20 ms). 
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Extended Data Fig. 1| TR-SFX sample preparation scheme and 
characterization. a, Two lipidic phases, the first containing KR2 crystals from 
which the acidic buffer had been washed away and the second containing 
soaking buffer, were mixed through a three-way coupler*®. The sodium 
concentration was adjusted to 150 mM and tests with litmus paper indicatea 
pH close to 8 inthe final mixture. b, Left, crystals after TR-SFX sample 
preparation with varying Tris buffers with pH values from 7.0 to 9.0. Right, KR2 
in 200 mM Tris, 150 mM NaCl, 0.02% DDM, 0.004% CHS solution with varied pH 
of Tris buffer. Note that crystals grownin LCP reacha red colour after mixing 
with Tris pH 9.0, whereas the solution reaches red colour at pH 8.0. This shift is 
probably due to residual buffering capacity from the original crystallization 
conditions, as confirmed by alitmus paper test. c, Size distribution of crystals 
determined by microscopy inspection after TR-SFX sample preparation. 

d, Microscopy picture of KR2 crystals grown at acidic pH and in the absence of 
NaCl; after washing out the crystallization buffer; the TR-SFX sample prepared 
via the procedure shown ina; and LCP with KR2 crystals soaked directly as 
control. The colour change upon increasing pH was confirmed in five 


independent experiments. e, Overview of TR-IR traces and visible absorption 
spectra obtained from blue KR2 crystals in the original acidic crystallization 
condition, red crystals prepared in analogy to the TR-SFX experiment, andin 
hydrated film at pH 8 in the presence of sodium chloride (solution). The 
corresponding lower panels showa kinetic analysis of KR2 intermediates 
obtained by global fit analysis of the spectroscopic data. Time constants are 
given in parentheses. Under acidic conditions, KR2incrystals exhibits an 
accelerated photocycle. In treated crystals, the critical deprotonation step in 
the Mintermediate occurs with similar kinetics asin purified KR2. f, g, Detailed 
view of the retinal binding pocket in the serial crystallographic room 
temperature structures of KR2 obtained from blue crystals at acidic pH (f) and 
red crystals after soaking in neutral pH and NaCl (g). Critical hydrogen bonds 
witha distance of <3.2 Aare indicated by black dotted lines. Arrows signify the 
distance from the SB to w406 and the D116 counterion. In neutral conditions, 
w406 has shifted away from the SB while D116 and N112 are now within 
hydrogen bonding distance. The resulting change in electrostatic environment 
is responsible for the colour change, as reported previously’. 
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Extended Data Fig. 2 | Spectroscopic analysis of sodium binding mode. 

a, b, Time-resolved IR absorption changes from KR2 microcrystals at pH 8 (a) 
andat pH 4 (b) recorded with tunable quantum cascade lasers as described™. 
The maximum concentration of the O intermediate is reached at around 

1-20 ms after pulsed excitation at pH 8 and is characterized by specific marker 
bands that are absent at pH 4, in particular the C=C stretching vibrational band 
of retinal at 1,516 cm” of the O state. The band at 1,688 cm‘ has previously been 
suggested to originate from the C=O stretching modeasaresult of sodium 
binding to an asparagine residue, presumably N112". The band at 1,554.cm tis 
tentatively assigned to the asymmetric stretching vibration of acarboxylate 
that rises upon binding of a sodium ion in bidendate or pseudo-bridging 
fashion where one of the carboxylate oxygens is interacting with another 
partner, Fora detailed analysis of the ligation, the corresponding 
symmetrical mode needs to beassigned, as the frequency difference between 
the COO asymmetric and symmetric stretchis dependent onthe mode of 
sodium ligation®*®. c, O(like)-KR2 (ground state) difference spectra recorded 
under different conditions. Spectra have been scaled to the ground-state 
bleaching band measured between1,530 and1,540 cm‘. It is well-established 
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that KR2 operates exclusively as asodium pump at neutral pH and inthe 
presence of sodiumions. KR2 acts as proton pump inthe presence of KCI but 
has no (known) function at pH 47°. Itis evident that the band at 1,688 cm ‘is 
most pronounced if sodium is pumped, which supports the assignment to the 
C=O stretching vibration of N112 upon binding of sodium. d, Time traces of the 
ethylenic stretch of the O state vibrating at 1,518 cm ‘and the two candidate 
vibrational bands at 1,420 cm ‘and 1,392 cm of the symmetric carboxylate 
stretching vibration. The band at 1,420 cm‘ exhibits similar kinetic behaviour 
tothe one at 1,518 cm". Hence, the former vibrational band is tentatively 
assigned as the symmetrical mode that relates to the asymmetrical vibration at 
1,554 cm of deprotonated D251 upon ligation of asodium ion. The difference 
in frequency between the symmetric and asymmetric modes is with 135 cm‘ at 
the edge of binding in bidentate to pseudobridging mode of model 
compounds®** >, This indicates that the two oxygens of D251 are not 
equidistant from the sodium ion. Such asymmetric ligation is expected inthe 
heterogenous environment of the protein interior as documented by our X-ray 
structures in the millisecond regime. 
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Extended Data Fig. 3 | Comparison of data truncation schemes and changes 
in retinal overtime. a, Top, F,— F, simulated annealing omit maps of the resting 
state at 40. No truncation (left) shows the map usingall dataupto1.6A 
resolution along a*, b* and c*. Spherical truncation (middle) shows the map 
using all data up to 2.2 A resolution along a’, b* and c*. Anisotropic truncation 
(right) shows the map using data up to 2.3A, 2.2 A and 1.6 A resolution along a*, 
b* and c*, respectively, as truncated by the staraniso server”. Bottom, 

F, (light) - F,(dark) difference density maps of the 1-ms time delay at the region 
around retinal and V117 at 30. The structure is shown as sticks (salmon, resting 
state; cyan, 1ms refined structure). No truncation (left) shows the map using all 
dataupto1.6A resolution. Spherical truncation (middle) shows the map using 
all data up to2.2A resolution. Anisotropic truncation (right) shows the map 
using data up to1.6 A resolution in c* as truncated by the staraniso server. 
Overall, the truncated data result in better electron density maps (both for 
2F,-F,.mapsand F,(light) — F,(dark) difference maps), with finer features being 
resolved. This effect is probably because noise along the missing directions is 
removed when compared to no truncation, while retaining the high-resolution 
data along c* when compared to spherical truncation. b, The evolution of 
electron density around the retinal chromophore over time. Retinal and 
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K255 of the refined structures are shownas sticks and the electron density is 
displayed around them (blue mesh, 20). Top, original 2F, -F, electron density 
map; panels below show extrapolated 2F,,.,,—F, maps. The extrapolated maps 
allow us to follow retinal isomerization in detail. Inthe dark state (top), the 
middle section or the retinal polyene chain is slightly bent downwards. Inthe 
picoseconds range, the isomerization is completed and the polyene appears to 
be straightened. In our ultrafast data, we did not observe retinal witha 
pronounced twist in the C13=C14 bond asin bR, with retinal in KR2 reaching a 
near planar 13-cis conformation much earlier along the activation pathway. In 
the time delays from nanoseconds to milliseconds, the electron density reveals 
abend inthe retinal molecule resulting from two planes that are twisted 
against each other. While the exact dihedral angles cannot be refined 
realistically onthe basis of the extrapolated data, the bend seems to originate 
from the C9=C10-C11=C12 dihedral angle as suggested for the L, MandO 
intermediates based ontime-resolved FTIR’ and resonance Raman 
spectroscopy”. After 20 ms, a definite conclusion concerning the retinal 
isomer is difficult. The extrapolated maps suggest that a fraction of the retinal 
molecules may have already re-isomerized to the all-trans conformation, while 
itis still bent sideways. 
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Extended Data Fig. 4| Experimental setup and evolution of structural 
changes over time. a, KR2 crystals were extruded inastream of LCP froma 
high-viscosity injector. At the interaction point, the crystals were pumped with 
575-nm femtosecond laser pulses before probing for structural changes with 
near parallel 12-keV XFEL pulses arriving from SwissFEL with a specific time 
delay At. The diffraction patterns were collected with aJungfrau 16 M detector 
inaseries of four light-activated patterns and one dark pattern, for whichthe 
visible pump laser was blocked. To reduce background through diffuse 
scattering of XFEL radiation in air, the experimental chamber was pumped 
downto 100-200 mbar while the residual air was replaced by helium. b, The 
evolution of structural changes in KR2 over time can be followed ina matrix of 
r.m.s.d. between all protein and retinal atoms (total of 2,703) inindividual KR2 
structural snapshots. The numerical r.m.s.d. values have been determined 
using the program pymoland are coloured ina blue-white-red gradient. The 
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black boxes highlight time delays at which we combined data based on manual 
inspection of electron density maps and the evolution of photo intermediates 
in KR2 crystals determined by TR-IR spectroscopy. The approach was inspired 
by arecently published tool for visualizing protein motions in time-resolved 
crystallography” but relies on refined atom positions instead of electron 
density changes. c, Difference electron density maps (F, (light) — F,(dark), 
negative density in gold and positive density in blue, contoured at 40 and 
shown together with the KR2 resting state) obtained at the indicated time 
delays. The first panel is included asa control and shows difference electron 
density obtained from 50,000 patterns collected with laser off and 100,000 
images from dark patterns obtained using the four light/one dark cycle used 
during TR-SFX data collection. For orientation, important residues discussed 
inthe main text are shownas sticks. 
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This work 
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Kato et al., 2015 
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Extended Data Fig. 5| Comparison of monomeric and pentameric KR2 
structures. Overall the structures at neutral pH reported in this work (left), in 
Kato et al.? (middle) and in Kovalev et al.”° (right) are very similar (r.m.s.d. of Ca 
atoms of 0.30 and 0.72A, respectively) with the positions of residues and 
hydrogen bonding pattern (lower insets, black dotted lines, defined as 
distance >3.2 A) in the binding pocket well preserved. Depending on 
conditions, the pocket in the pentameric structure can adopt the extended 
conformation shown. Here N112is rotated out into the interaction interface 
between two KR2 protomers (coloured red) and the spaceis filled with water 
molecules. As the pentameric resting state can adopt a more open 
conformation in the SB region’, we cannot exclude the possibility that the 
electrostatic gate does not fully openin our crystals formed from monomeric 
KR2 (see main text). However, an unpublished steady-state structure of 
pentameric KR2 under continuous illumination (PDB 6XYT, Gordeliy group, 
IBS Grenoble) shows N112 rotated back into the binding pocket ina 
conformation very similar to our time-resolved millisecond states. As the 


retinal is modelled inthe trans configuration, the 6XYT structure may 
representa later intermediate compared to what we observe. The opening 
along the retinal (calculated by the program Caver’’) in our 30 ps +150 ps 
structure is 1.4 A, whichis very close to the1.9 Aneeded for asodium ionto 
pass. Stable structural intermediates that accumulate in time-resolved studies 
of molecule ensembles do not necessarily reveal all functional states® and it 
seems reasonable that a gate should form only when sodium isin close 
proximity to the SB. This would allow fast transfer of sodium to the 
extracellular side upon deprotonation. The gate would then collapse 
immediately upon sodium binding on the extracellular side, coinciding with 
reprotonation of the SB. Sucha mechanism would efficiently prevent sodium 
back leakage. Differences in the position of N112 between the monomeric and 
pentameric resting states could furthermore explain the long-range effects of 
mutations in the oligomerization interface onthe photocycle”’ and sodium 
pumping efficiency'””°. 
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Extended Data Fig. 6 | See next page for caption. 
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Extended Data Fig. 6 | Comparison ofion binding sites in selected members 
of the rhodopsin family and QM/MM simulations. a, Sodium binding site in 
Krokinobacter eikastus rhodopsin 2 (KR2, this work). b, Chloride binding site in 
the related bacterial pump Nonlabens marinus rhodopsin 3 (NM-R3, PDB 
5G28°”). c, Protonated SB in bacteriorhodopsin (bR, PDB 6RQP”’). d, Chloride 
binding site in halorhodopsin (HR, PDB1E12°). The retinal chromophore 
together with selected interactions and amino acid side chains are shown as 
dotted lines or sticks. Besides overall similarity in ion binding between bacteria 
and archea, the suggested binding site in KR2is inline with quantum chemical 
calculation of absorbance changes upon sodium binding. e-g, The QM/ 
MM-optimized geometry of the resting state (e) and the 1-ms structure with 
water (f) or sodium (g) bound to N112, D251. h, Comparison of the absorption 
shifts of the states measured in UV/Vis difference spectroscopy with the 
calculated QM/MM excitation wavelengths. Absolute values are given in 
parentheses. i,j, The extent and location of structural changes in KR2 at 20 ms 
(i) and bRat 15 ms (j) after activation (blue-to-red gradient and ribbon width 
indicate r.m.s.d. to the resting state, bR coordinates taken from previous 
work”), The light-driven pumping against a concentration gradient is achieved 
with smaller conformational changes in the order of 1-2 Ain sodium pumping 
KR2 compared to the more elusive protons pumped by bR**”°. The 
translocation path for sodium and protons (arrows), however, is similar for both 
proteins and includes three critical sites (see reviews on bR structural 


dynamics” and activation mechanism’’). Close to the water-filled cavity onthe 
cytoplasmic side of KR2, Q123 of the NDQ motif is the most likely location for 
where the sodium ion loses its water coordination shell. The position is 
analogous to D96 of the DTD motif in bR, which is the primary donor for 
reprotonation at the end of the pumping cycle. The second critical site is 
formed by D116, N112 and D251, which correspond to D85, T89 and D212 inbR 
and coordinate sodium and proton transport inthe SB region. The role of R109 
in switching from water to sodium binding in KR2is analogous in position and 
function to that of R82, which regulates the transfer of protons towards the 
release group in bR. The proton release group in bR is formed by E194 and E204, 
with the position overlapping well with the second sodium binding site 
between E11, N106 and E160 at the extracellular side of KR2. Beside these 
similarities, the sequence of events is clearly different between the two 
outward ion pumps. Light-driven pumpingin bR starts with (1) aprebound 
proton followed by (2) release and (3) reloading. In KR2 the processis shifted in 
sequence with (1) entry, (2) binding and (3) release with a corresponding 
adaptation of the photocycle intermediates. The similar ion binding modes 
between bacteria and archea, together with how retinal isomerization is used 
to drive them, provides a notable example of the evolutionary economy of 
nature. The adaptation of this common leitmotifis particularly interesting in 
the case of KR2, where substrate binding has been shifted from the stable 
resting state into a transient intermediate. 


Extended Data Table 1| X-ray statistics 


Dark- Dark- 800 fs 2ps 100 ps 1ns 16 ns dus 30 us 150 us 1ms 20 ms 800fs+2ps ins+16ns 30us + 150us 
acidic neutral 

Data collection 

Space group 1222 

Cell dimensions 

a, b,c (A) 41.5, 84.5, 235.6 
a, By (°) 90, 90, 90 
Indexed pattems 47518 158'832 47801 47595 43'162 45045 50165 55'041 50'754 48'383 60'629 48'329 95396 95210 99137 
Indexing rate (%) 12.51 11.30 10.68 15.28 9.65 14.11 14.28 11.87 9.00 9.61 12.23 9.16 - - - 
Overall statistics excluding anisotropic shell 12.4 A A (Overall statistics 12.4 A — 1.60 A including anisotropic shell 

Resolution along a*; 2.0; 2.2; 2.1; 2. 21; 2. 2.1; 2.1; 2.1; 2. 21; 2.0; 2.0; 

b*; c* (A) 2.3; 1.6 2.3; 1.6 2.3;1.7 2.20; 1.6 2.20; 1.6 2.3,1.7 2.3; 1.7 2.3;1.7 2.3; 1.7 2.2;1.7 2.20;1.6 2.2; 1.6 2.2;16 

No. reflections 18°807 18873 18'840 18'840 18'872 18847 18827 18'828 18'793 18'812 18847 18863 18'874 18872 18'861 
(28904) (31741) ~~ (26799) (27107) (29°755) ~~ (28057) =~ (26141) ~~ (26237) ~~ (25531) + (26123) ~—s (27051) ~—- (29034) ~—- (29870) (30'047) (28'402) 

Completeness (%) 99.7 100 99.8 99.8 100 99.9 99.8 99.8 99.6 99.7 99.9 99.9 100 100 99.9 
(62.3) (57.5) (48.5) (49.1) (53.9) (50.8) (47.3) (47.5) (46.2) (47.3) (49.0) (52.6) (54.1) (64.4) (51.4) 

Multiplicity 608 1640 522 497 475 507 547 563 505 479 594 531 1018 1053 983 
(505) (1328) (444) (419) (395) (431) (469) (481) (439) (413) (505) (445) (834) (869) (827) 

Rp (%) 6.1 43 8.0 83 84 7.8 73 74 71 78 6.9 76 5.7 5.3 5.3 
(6.8) (4.9) (8.6) (9.0) (9.1) (8.4) (7.8) (8.0) (7.7) (8.4) (7.5) (8.2) (6.3) (6.9) (5.9) 

CCi2 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 
(1.00) (1.00) (0.99) (0.99) (0.99) (0.99) (0.99) (0.99) (0.99) (0.99) (0.99) (0.99) (1.00) (1.00) (1.00) 

<l/o(I)> 11.4 16.5 9.2 9.1 9.2 9.2 96 97 94 93 10.4 98 12.8 13.1 13.1 
(8.3) (11.4) (7.1) (7.0) (6.7) (7.0) (7.5) (7.6) (7.5) (7.3) (7.9) (7.2) (9.1) (9.2) (9.6) 

High resolution statistics of the anisotropic shell (2.3 A - 1.60 A) 

No. reflections 10°097 12'867 7959 8'267 10'883 9210 7314 7409 6738 7311 8'204 10171 10'996 11175 9541 

Completeness (%) 27.7 35.4 21.9 22.7 29.9 25.3 20.1 20.4 18.5 20.1 22.6 28.0 30.2 30.7 26.2 

Multiplicity 331 872 261 241 256 274 269 272 256 244 302 285 519 559 518 

Repit (%) 38.3 30.0 43.8 44.3 38.2 41.0 45.2 44.0 45.8 43.8 42.1 38.2 37.0 35.4 37.1 

CCi2 0.90 0.95 0.86 0.84 0.89 0.89 0.84 0.86 0.85 0.86 0.87 0.89 0.91 0.92 0.91 

<I/o(|)> 2.5 3.1 2.2 2.2 2.5 2.4 22: 22 22 2.2 2.3 25 26 27 26 
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Extended Data Table 2 | Refinement statistics 


Data collection 

Space group 

Cell dimensions 
a, b,c (A) 


a, By (°) 


Resolution (A) 


Rept 

<I/o(I)> 

CCi2 
Completeness (%) 
Multiplicity 


Refinement 
Resolution (A) 


No. reflections 
Ruork | Riree 
No. atoms 
Protein 
Ligands 
Water 
B-factors 
Protein 


Ligands 
Water 


R.m.s. deviations 
Bond lengths (A) 
Bond angles (°) 

PDB Code 


* Data are anisotropic, see extended statistics Table 1 and Materials and Methods 


Data are anisotropic; see Extended Data Table 1 and Methods. 


Dark-acidic 


12.36 - 1.6" 
(23-16) 
6.8 (38.3) 
8.3 (2.5) 
1.00 (1.00) 
52.3 (27.7)* 
505 (331) 


12.36-1.6 
(1.64 - 1.6) 
28'895 (123) 
17.67 /21.3 


2139 
242 
99 


31.8 
61.5 
46.9 


0.011 
1.063 
6TK7 


Dark-neutral 


12.36 — 1.6" 
(2.3- 1.6) 
4.9 (30.0) 
11.4 (3.1) 
1.00 (1.00) 
57.5 (35.4)* 
1328 (872) 


12.43-1.6 
(1.64 — 1.6) 
31'728 (203) 
18.8/21.7 


2105 
242 
60 


36.1 
62.1 
46.1 


0.006 
0.735 
6TK6 


800fs + 2ps 


12.36 — 2.25 
(2.33 - 2.25) 
5.8 (24.29) 
12.2 (3.5) 
0.99 (0.95) 
99.9 (98.6) 
994 (643) 


12.36 - 2.25 
(2.35 — 2.25) 


18°021 (1874) 
27.2132.8 


2103 
242 
53 


42.1 
65.6 
416 


0.008 
0.938 
6TK5S 


ins + 16ns 


1222 


41.5, 84.5, 235.6 
90, 90, 90 


12.35 —2.25 
(2.33 - 2.25) 
5.4 (24.1) 
12.5 (37) 
0.99 (0.96) 
99.9 (98.6) 
1028 (660) 


12.35 - 2.25 
(2.35 — 2.25) 
18'073 (1906) 
26.5 / 33.2 


2111 
242 
50 


40.1 
65.5 
43.6 


0.007 
0.928 
6TK4 


30us + 150ys 


12.35 — 2.25 
(2.33 - 2.25) 
5.4 (27.3) 
12.5 (3.3) 
0.99 (0.96) 
99.7 (97.5) 
960 (616) 


12.35 -2.25 
(2.35 — 2.25) 


17°917 (1920) 
26.7/ 33.5 


2099 
242 
49 


33.7 
53.2 
36.3 


0.007 
0.928 
6TK3 


ims 


12.36 —2.5 
(2.58 - 2.5) 
6.6 (17.3) 
12.2 (5.1) 
0.99 (0.97) 
99.9 (100) 
647 (443) 


12.36 -2.5 
(2.66 — 2.5) 
13'446 (2007) 
25.8 / 32.4 


2099 
243 
49 


417 
63.6 
43.6 


0.007 
0.865 
6TK2 


20ms 


12.36 -2.5 
(2.58 - 2.5) 
7.4 (17.1) 
11.3 (5.6) 
0.99 (0.96) 
99.9 (100) 
579 (400) 


12.36 -2.5 
(2.66 — 2.5) 
13'268 (1871) 
27.8/35.3 


2099 
243 
47 


52.1 
69.9 
50.1 


0.008 
0.921 
6TK1 
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A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 
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Data collection A full description is provided in the material and methods section. 


Data analysis A full description is provided in the material and methods section. 
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- A description of any restrictions on data availability 


Resting state coordinates and structure factors have been deposited in the PDB database under accession code 6TK7 (acidic pH) and 6TK6 (neutral pH). Together 
with the neutral pH resting state structure the light activated data for all time points (800 fs, 2 ps, 100 ps, 1 ns, 16 ns, lus, 30 Us, 150 us, ms and 20 ms) was 
deposited in the mmCIF file. For the refined structures using combined data (800fs+2ps, 1ns+16ns and 30uUs+150us) and single (lms and 20ms) light activated data 
sets, coordinates, light amplitudes, dark amplitudes and extrapolated structure factors have been deposited in the PDB database under accession codes 6TK5 (800fs 
+2ps), 6TK4 (1ns+16ns), 6TK3 (30?s+150?s), 6TK2 (1 ms) and 6TK1 (20 ms). 
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Corrections & amendments 


Author Correction: New 
infant cranium fromthe 
African Miocene sheds 
light on ape evolution 


https://doi.org/10.1038/s41586-020-2466-7 


Correction to: Nature https://doi.org/10.1038/nature23456 


Published online 10 August 2017 


® Check for updates 


Isaiah Nengo, Paul Tafforeau, Christopher C. Gilbert, John G. Fleagle, 
Ellen R. Miller, Craig Feibel, David L. Fox, Josh Feinberg, 

Kelsey D. Pugh, Camille Berruyer, Sara Mana, Zachary Engle & 

Fred Spoor 


In Extended Data Fig. 3f of this Article, the specimen originally labelled 
‘P. pygmaeus isin fact aspecimen of Pan troglodytes, and has now been 
replaced with a confirmed specimen of the Bornean orangutan (Pongo 
pygmaeus). Figure 1 of this Amendment shows the corrected Extended 
Data Fig. 3. To correct the sample numbers for each species and to add 
the new Pongo specimen now used in Extended Data Fig. 3, the Meth- 
ods paragraph “The general dental development pattern of KNM-NP 
59050... found only in Hylobates and Hoolock.” should read as follows: 
“The general dental development pattern of KNM-NP 59050, and the 
advanced I’ development in particular, were studied in more detail by 
making comparisons with extant juvenile hominoids and cercopithe- 
coids. These included Pan troglodytes (13), Gorilla gorilla (3), Pongo 
pygmaeus (2), Homo sapiens (6), Hoolock sp. (4), Hylobates muelleri 
(1), Nomascus hainanus (1); and the cercopithecoids Papio ursinus (1), 
Cercopithecus petaurista (1), Macaca sp. (2),and Macaca nigra (1). These 
specimens are inthe collections of the Musée des Confluences de Lyon 
and were scanned at the European Synchrotron Radiation Facility, 
except for one Pongo specimen from the Museum of Comparative Zool- 
ogy, Harvard University that was scanned at the Center for Nanoscale 
Systems (CNS) at Harvard University, the Hoolock material, which is 
housed in, and was scanned at, the American Museum of Natural History 
(New York), as well as Hylobates and Nomascus specimens, which are 
housed at the Museum fiir Naturkunde, Berlin, and were scanned at the 
Max Planck Institute of Evolutionary Anthropology (Leipzig). Results of 
these comparisons show that the unusual pattern of advanced develop- 
ment of the I's is found only in Hylobates and Hoolock.’ These errors do 
not affect any of the observations and conclusions. The original Article 
has not been corrected. 


Fig. 1| This figure shows the corrected Extended Data Fig. 3f. Pongo pygmaeus. Scale bar, 2cm. 
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Publisher Correction: 
IGFIR is an entry receptor 
for respiratory syncytial 
virus 


https://doi.org/10.1038/s41586-020-2437-z 


Correction to: Nature https://doi.org/10.1038/s41586-020-2369-7 


Published online 03 June 2020 


® Check for updates 


Cameron D. Griffiths, Leanne M. Bilawchuk, John E. McDonough, 
Kyla C. Jamieson, Farah Elawar, Yuchen Cen, Wenming Duan, 
Cindy Lin, Haeun Song, Jean-Laurent Casanova, Steven Ogg, 
Lionel Dylan Jensen, Bernard Thienpont, Anil Kumar, 

Tom C. Hobman, David Proud, Theo J. Moraes & David J. Marchant 


In Fig. 2d of this Article, owing to an error during the production 
process, the final rightmost x-axis label should be ‘Ciliated bronchial 
epithelial cells’ and not ‘Lung fibroblasts’. This error has been corrected 
online. 
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Corrections & amendments 


Publisher Correction: 
Resolving acceleration to 
very high energies along 
the jet of CentaurusA 
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Correction to: Nature https://doi.org/10.1038/s41586-020-2354-1 


Published online 17 June 2020 


® Check for updates 


The H.E.S.S. Collaboration 


Inthe print version of this Article, M. Holler, M. de Naurois, F. Rieger, 
D. A. Sanchez and A. M. Taylor should have been indicated as corre- 
sponding authors in the author list, all with email address contact. 
hess@hess-experiment.eu. Additionally, the line ‘Correspondence and 
requests for materials should be addressed to M.H., M.d.N., F.R., D.A.S. 
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® Check for updates 


ARISING FROM: M. Momcilovic et al. Nature https://doi.org/10.1038/s41586-019-1715-0 (2019) 


The recent paper by Momcilovic et al.' presents important results on 
mitochondrial metabolism differences among various lung cancer sub- 
types in mouse. However, the work’ propagates critical misunderstand- 
ings and omissions about the underlying basis for the application of 
voltage-sensing tracers. The principle of imaging of mitochondrial mem- 
brane potential (AY,,) with radiolabelled positron emission tomography 
(PET) probes is based on benchtop in vitro methods. In vitro quantifica- 
tion of AW, has been established for decades using various lipophilic 
cations such as 7H-tetraphenylphosphonium (?H-TPP*)?*. In all cases, 
the Nernst equation is used to relate the equilibrium concentration of 
the probe on either side of amembrane to the electric potential across 
the membrane’. Importantly, the Nernst equation is only valid when 
probe concentrations are at equilibrium—that is, the concentrations 
on either side of the membrane are not changing with time. 

Momcilovic et al. claim to “measure mitochondrial membrane 
potential in non-small-cell lung cancer in vivo using a voltage-sensitive 
PET radiotracer”, the lipophilic cation 4-['8F]fluorobenzyl- 
triphenylphosphonium (8F-BnTP). It must be pointed out that, contrary 
tothe authors’ claim, they do not measure membrane potential. Their 
work is based on the empirical endpoint, per cent dose per gram-tissue 
(normalized by the %dose of the myocardium). %dose of '8F-BnTP is 
time-dependent and a function of several physiological variables, 
including the level of '$F-BnTP in plasma, the fractional volumes of 
extracellular space (ECS), the cellular membrane potential, and also 
AW_,. The need to normalize the endpoint to that of some other organ 
that may evolve differently in time is also a problem for reproducibil- 
ity and clinical translation. We have previously studied these crucial 
aspects of howthe research and clinical translation of voltage-sensing 
compounds can take place®. We have shown that a unique absolute 
endpoint, in units of millivolts (mV), can be obtained during secular 
equilibrium of a radiotracer such as “F-TPP* and '$F-BnTP. Notably, 
we have shown that, at steady state, the tracer concentration in tis- 
sue depends nearly linearly on the tracer concentration in plasma, on 
the fractional volumes of mitochondria and on the ECS, but depends 
exponentially on the sum of the cellular and mitochondrial membrane 
potential: 


CHCl fodnse (1) 


where C; and C, (in units of Bq ml”) are, respectively, the steady-state 
tracer concentration in tissue and plasma, AW, and AW, [mV] represent, 
respectively, the cellular and mitochondrial membrane potentials, fecs 
and fnito (unitless) represent, respectively, the volume fractions of ECS 
and mitochondria, and B [mV] is a constant term representing the 
ratio of known physical constants. 


Itis also interesting to note that at steady state, equation (1) can be 
used to express the result in terms of percent dose: 


%D_~ %D (1 fics Foino@ ee (2) 


where we have expressed the result as a ratio of %dose fractions (%D +. ,)- 
We have previously found that the plasma tracer concentration C, 
decreases monotonically after bolus injection of '$F-TPP*, meaning 
that the %dose index will not be time-invariant, which may limit the 
reliability of this endpoint and its usefulness in research and clinical 
translation. 

Although Momcilovic et al.’ did show similar mitochondrial density 
(fmito) for both cancer subtypes with the pan-mitochondrial marker 
TOM12, they did not account for other critical variables such as the 
volume fraction of ECS and the plasma tracer concentrations. Finally, 
adequate quantification of the membrane potential requires measure- 
ments of equilibrium tissue and plasma concentrations®, which cannot 
be done witha bolus radiotracer injection and delayed imaging protocol 
as performed in the study. 

Nevertheless, the results of Momcilovic et al.!add to the body of 
evidence supporting the potential role of in vivo assessment of mito- 
chondrial status in oncology. The authors rightly pointed out that 
imaging membrane potential might represent a valuable resource for 
the evaluation of mitochondrial activity in several areas of research 
including ageing, physiology, and diseases. However, for successful 
translation of the methodology to human research and ultimately 
to the clinic, accurate and reproducible quantification is necessary. 
This can only be achieved with proper techniques and accounting for 
all critical variables. 


1... Momcilovic, M. et al. In vivo imaging of mitochondrial membrane potential in 
non-small-cell lung cancer. Nature 575, 380-384 (2019). 

2. Kauppinen, R. Proton electrochemical potential of the inner mitochondrial membrane in 
isolated perfused rat hearts, as measured by exogenous probes. Biochim. Biophys. Acta 
725, 131-137 (1983). 

3.  Rottenberg, H. Membrane potential and surface potential in mitochondria: uptake and 
binding of lipophilic cations. J. Membr. Biol. 81, 127-138 (1984). 

4. Wan, B. et al. Effects of cardiac work on electrical potential gradient across mitochondrial 
membrane in perfused rat hearts. Am. J. Physiol. 265, H453-H460 (1993). 

5. Nernst, W. Die elektromotorische Wirksamkeit der Jonen. Z. Phys. Chem. 4, 129-181 
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6. Alpert, N. M. et al. Quantitative in vivo mapping of myocardial mitochondrial membrane 
potential. PLoS One 13, e€0190968 (2018); correction 13, e€0192876 (2018). 
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® Check for updates 


REPLYING TO: N. M. Alpert et al. Nature https://doi.org/10.1038/s41586-020-2366-x (2020) 


We thank the authors of the accompanying Comment? for their inter- 
est in our work and acknowledging the biomedical significance of our 
recent multi-disciplinary study using positron emission tomography 
(PET) imaging to assess mitochondrial membrane potential in vivo 
within lung tumours”. To accomplish this work we used the *F-labelled 
lipophilic cation tracer '8F-BnTP as a tool to image lung tumours in 
mouse models and identified conserved differences in mitochondrial 
function within lung tumour subtypes. Our work, along with studies by 
others in the field*“, builds upon prior work by Alpert and colleagues® 
and demonstrates that 8F-BnTP accumulation within tissues is respon- 
sive to the voltage across the mitochondrial inner membrane, enabling 
measurement of relative changes in mitochondrial membrane potential 
in tissues such as the heart. However, although we acknowledge the 
work performed by Alpert et al.on measuring mitochondrial membrane 
potential, we disagree with their claims that our study “represents 
critical misunderstandings and omissions”! regarding the underlying 
basis of the application of voltage-sensing tracers. 

It has been well established that the uptake of the “F-BnTP tracer 
into tissues increases with the voltage across the mitochondrial inner 
membrane and the plasma membrane. However, relating this uptake 
to voltage is challenging—therefore we imaged mice at 60 minutes post 
8F-BnTP injection to ensure that each animal wasas close to steady state 
as possible. We chose the heart as a tissue to normalize uptake of the 
probe because this tissue reaches a steady-state level of accumulation 
in around 10 min and shows little change in probe retention over the 
following 50 min. This enabled us to clearly show changes in probe 
uptake within the tumours. Although taking multiple plasma samples 
and kinetic imaging analysis would be interesting, it would add limited 
value and make the scope of studies performed impractical. Further 
computed tomography (CT) imaging for measurement of fractional vol- 
umes of extracellular space and the plasma tracer concentrations stud- 
ies to better quantify membrane potential—although appealing—are 
not practical, and would not translate to clinical situations. Although 
these extra measurements would partially facilitate absolute quanti- 
fication, this isnot necessary to show the clinically significant relative 
changes in membrane potential between tumour and control tissue. 

The remaining debate lies in our description of the quantitative meas- 
urement of mitochondrial membrane potential. Alpert et al.’are right 
to state that we did not directly measure mitochondrial membrane 
potential in millivolts (mV) and that it was imprecise to use this termin 
our original paper. Instead, we measured the relative uptake of “F-BnTP 
in lung tumours normalized to uptake of the tracer in the heart. The 
changes in uptake of '$F-BnTP that were measured are reflective of 


changes in mitochondrial and plasma membrane potentials follow- 
ing the bolus injection of the PET ligand at a set time. Importantly, by 
carefully measuring the relative uptake of 8F-BnTP in the tumour and 
the heart tissue at the same dose and time after injection of PET ligand, 
and using built-in controls that include treating mice with bona fide 
respiratory-chain inhibitors, we can infer relative differences within 
each tumour that are consistent with changes in mitochondrial polari- 
zation. Our methodology is analogous to in vitro cell culture assays that 
measure relative changes in mitochondrial membrane potential using 
dyes, such as TMRE and Rhodamine 123, that accumulate within the 
mitochondria in a voltage-dependent manner. In these assays, absolute 
values of the mitochondrial membrane potential in mV are difficult 
to infer from the fluorescent intensity measurements, because the 
application of the Nernst equation requires the measurement of many 
other cell parameters, which are challenging to determine. Instead, the 
relative differences in membrane potential are presented in arbitrary 
units (a.u.). The change in units—whether in mV or a.u.—is reflective of 
differences in the mitochondrial membrane potential. Applying this 
robust approach to our in vivo work, we have not presented absolute 
measurements in mV of mitochondrial membrane potential in lung 
tumours, but instead we have measured the ratio of tracer uptake in 
lung tumours to that in heart tissue. In essence, the matter that has 
arisen can be reduced to a difference in methodology—we measured 
and quantified a ratio instead of an absolute value in mV. Using this 
methodology we are able to accurately deduce changes in '8F-BnTP 
uptake in lung tumours that reflect perturbations in the mitochondrial 
membrane potential. 

We have demonstrated that our approach is reproducible in mice 
and shown that $F-BnTP is a valuable tool for the in vivo study of mito- 
chondrial biology in mouse models of lung cancer. We recognize that 
clinical translation of '8F-BnTP into humans will require additional 
studies. However, this does not detract from the impact or accuracy 
of pre-clinical studies, or the importance of this probe in future clini- 
cal applications. We agree that clinical translation of probes such as 
18F-BnTP will require consideration of other variables, rigorous feasibil- 
ity studies and cooperation between multiple disciplines. Finally, our 
work’ as well as that of Alpert and colleagues’ is critical for advancing 
technologies to characterize mitochondrial function in vivo, and as 
such, we are thankful to the authors of the accompanying Comment! 
for addressing this matter in open and respectful debate. 
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Campaigners march through Amsterdam on 1 June to protest against anti-Black violence in Europe and the United States. 


FIGHTING RACISM DEMANDS 
MORE THAN JUST WORDS 


Frustrated and exhausted by systemic bias in the science community, 
Black researchers call on their colleagues and institutions to take action. 


lack academics are calling out racism 
in science, recounting behaviours 
ranging from overt acts to micro- 
aggressions, using social-media 
hashtags such as #BlackInThelvory. 
A study in April (B. Hofstra et al. Proc. Natl 
Acad. USA 117, 9284-9291; 2020) highlighted 
how students from under-represented groups 
innovate more than their white male counter- 
parts do — but receive few to no career benefits 
from their discoveries, because their contri- 
butions are often overlooked. Nature spoke 
to six Black academic researchers about the 


effects of racism on their careers, their advice 
to their white colleagues and their thoughts 
on meaningful institutional actions. 


VASHAN WRIGHT 
WHITE COLLEAGUES HAVE THE 
POWER TO CHANGE THE SYSTEM 


Before I started out as an undergraduate, 
I thought a university campus was a magi- 
cal place. I thought I'd be treated not on the 
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basis of my skin colour, but according to how 
smart I was and my efforts to make the world 
a better place. This view changed when! saw 
a Confederate flag flying on my campus — the 
first time I’d seen one in real life. It changed 
when someone used a racial epithet against 
me and threatened to kill me. 

Black academic success happens despite 
systemic racism and bias. It’s hard to tell whether 
dismissive body language is bias. It’s hard to 
tell why my first paper was in limbo for ten 
months and rejected without review. It’s hard 
to tell whether people’s faces are expressing 
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surprise at my ideas, or at the fact that those 
ideas are coming from someone who is Black. 
It’s those quiet things that eat away at you. 

In March, I started as a postdoc at Woods 
Hole Oceanographic Institution (WHOI) in 
Massachusetts, which I had visited a couple 
of times as a PhD student. But I haven't phys- 
ically been there in my new position because 
of the coronavirus. | enjoyed earning a PhD in 
a laboratory with a diverse range of students 
at Southern Methodist University in Dallas, 
Texas, and I'll miss that environment, but 
l appreciate that WHOI has formed a com- 
mittee for diversity and inclusion and a 
partnership-education programme to foster 
aculture of inclusivity. 

More recently, Hendratta Ali, a geoscientist 
at Fort Hays State University in Kansas, and 
colleagues crafted an anti-racist action plan 
for geoscience societies. It calls for them to 
collect scientifically valid data on diversity, 
equity and inclusion, to publish accountability 
reports and to ensure that diversity and racial 
justice are discussed at well-attended events. 

Still, the geosciences are among the least 
diverse disciplines in science, technology, 
engineering and mathematics. I sometimes 
worry that Ill never be able to recreate the 
diverse environment that I enjoyed during 
my PhD, but lam committed to that vision as 
ladvance in my career. 

Some white faculty members don’t want 
to acknowledge that Black students experi- 
ence racism. They don’t necessarily deny your 
experiences, but they often look for another 
explanation to try to protect you. A faculty 
member of colour knows that they cannot 
protect you from it. 

Scientific communities need to decide 


where, and for what, they stand by asking 
their members: what message are you send- 
ing if youare not actively being anti-racist and 
trying to change the system? I say to my white 
colleagues: you have the most power in the 
geosciences; you benefit the most from racism 
and lack of diversity. It is therefore yourjob to 
fix the system. But I'll help you. 


Vashan Wright is a geophysicist at 
Woods Hole Oceanographic Institution, 
Massachusetts. 


MARK RICHARDS 
COMMITTO BOLD 
HIRING TARGETS 


It’s quite easy for a Black person in the United 
Kingdom to relate to underlying issues of 
racism in the United States. Heavy-handed 
policing is not new to our Black community. 
Like most people around the world, we are 
experiencing sadness, anger and frustration. 
Being one of the few Black academics in my 
field and at this university, I feel a level of 
responsibility to do what I can to nurture and 
inspire the next generation. To bea good ally, 
it’s not quite enough to be neutral. You have 
to be anti-racist when necessary. Racism is 
like a virus — attitudes spread when they are 
validated. To kill the racism virus, you have to 
distance yourself from that ideology so that 
eventually there is nowhere for it to be trans- 
mitted. The best way to combat white suprem- 
acy is to focus on Black excellence. 

Until recently, many institutions didn’t feel 
there was a problem. They lamented that no 


Physicist Mark Richards wants institutions to take on more students from minority groups. 
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Black students were applying, but a ‘What can 
we do?’ attitude prevailed for many years. I 
think that institutions must look at their data 
to shed light on whether they are being insti- 
tutionally racist. Ifa university is located inthe 
centre of a huge city with a high Black popu- 
lation living locally, but has only a minuscule 
number of Black students, that has to tell you 
something. 

At Imperial College London, we have 
been trying for 15 years to address a short- 
age of Black students and those from other 
under-represented groups, through an advi- 
sory group called Imperial as One. Yet rep- 
resentation of people from all minority ethnic 
groups remains low here, at less than10%. The 
problem will not be solved simply by employ- 
ing more faculty members of colour. We have 
to actually attract students from under-rep- 
resented groups. Imperial did something 
decisive and made a commitment to double 
its intake of students from under-represented 
groups over the next five years. The push to 
achieve this goal is due to start this autumn. I 
hope other academic institutions will replicate 
these types of action. 

Beyond simply treating all people as equals, 
enough data exists to show that diverse teams 
are more productive. Posing the question, ‘Is 
your university being as diverse as it could 
be?’ is essentially the same as asking, ‘Is the 
university being as productive as it could be?’ 
What leader doesn’t want to ask that question? 


Mark Richards is a physicist at Imperial 
College London. 


KISHANA TAYLOR 
CONSIDER ‘CLUSTER HIRING’ 


I met my postdoc adviser, virologist Sam 
Diaz-Mufioz, in 2017 at a Twitter meet-up, ata 
conference organized by the American Society 
for Microbiology in New Orleans, Louisiana. 
At the end of the conversation, he asked me to 
apply for ajob in his new lab at the University 
of California, Davis. 

My PhD experience, at the University of 
Georgia in Athens, had dealt a blow to my 
self-esteem. I was the only Black graduate 
student in a department with no Black 
professors. I was by myself, in terms of rep- 
resentation, in a state where Black people 
make up 30% of the population. When I met 
my postdoc adviser, I hadn’t yet published any- 
thing and had no confidence that I would be 
competitive. Had he not contacted measecond 
time, | would not have pursued that position. 

During my PhD programme, between 2013 
and 2018, high-profile shootings of unarmed 
Black people — notably Eric Garner, Tamir Rice, 
who was just 12 years old, and Michael Brown 
— prompted a number of criminal trials, but 


MARK RICHARDS/IMPERIAL COLLEGE 


DOANISE THOMPSON 


Virologist Kishana Taylor calls for more scientists of colour in leadership positions. 


none secured a conviction. I'd come into the 
lab feeling heavy and upset after these deaths. 
No one noticed. I tried to talk about my feel- 
ings to a staff member, who told me not to get 
too upset, because we didn’t know what had 
happened. I learnt quickly to not even bother 
trying to have those conversations. 

I was going to leave science after my PhD 
unless I found a lab that valued diversity. 
Although I found a principal investigator who 
does, my department at Davis doesn’t have any 
Black professors. Academic institutions need 
to take action instead of just saying they value 
students from under-represented minority 
groups. We need more scientists of colour in 
leadership positions. 

Practices such as ‘cluster hiring’, which can 
be used to diversify faculties, have been tried 
at anumber of universities and should get 
more attention. With cluster hiring, univer- 
sities advertise multiple faculty positions at 
once, but don’t always stipulate specific fields; 
this can improve the odds that candidates 
from under-represented minority groups will 
be selected, making it easier for them to fit inif 
they are. It’s infinitely more lonely, and harder 
to adjust, if someone is the only Black person 
to be hired in a department. For example, the 
entire University of Maryland system has aUS 
National Science Foundation grant to createa 
pre-professoriate pipeline, hiring transitional 
postdocs with the intent of bringing them in 
as faculty members ina year or two. 

It’s important for white colleagues to ask 
themselves whether they’d be comfortable 
walking into a room full of Black or Latin 
American people — and then to imagine what 
it’s like for us when, every day of our lives, we 
have to enter rooms full of only white people. 
My biggest pet peeve is when white colleagues, 


who do research for aliving, ask me for advice 
on how to be anally without having done any 
research. It’s not hard to find journal articles 
that detail the impact of diversity, equity and 
inclusivity initiatives. 


Kishana Taylor is a virologist at the University 
of California, Davis. 


NIKEA PITTMAN 
CREATE OPPORTUNITIES FOR 
DIFFICULT CONVERSATIONS 


My university and its diversity and inclusivity 
committee have released messages of support 
for the Black community following the protests 
in the wake of George Floyd’s death, but it can 
feel like there’s a lot of silence beyond those 
messages. It’s hard for a Black person to initiate 
conversations around this topic because of a 
fear of coming across as aggressive, especially 
when those conversations might not be wel- 
come. In early June, two white male colleagues 
(not in my lab) discussed the protests, stand- 
ing less than a metre away from me, while I 
pretended not to be hurt by the exclusion. 

Almost two weeks later, one of the youngest 
graduate students in my lab suggested we have 
a group conversation about race. | intention- 
ally didn’t open my mouth until everyone else 
had spoken. I hadn’t heard what my colleagues 
think about these issues, and it was easier for 
me to open up once it became apparent that 
everyone wanted to know how to be anally to 
the Black community. If your lab hasn’t had this 
conversation yet, you can still have it — and it 
can make a big difference. 

I’ve been waiting for more than a decade to 
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hear non-Black people be outraged at the way 
Black people are treated in the United States, 
and was hopeful that we could collectively 
mourn. Even though it’s not my responsibil- 
ity, | want to help make these conversations 
easier. But I don’t always know howto proceed. 
I jumped on a bandwagon of Black academ- 
ics on social media who are volunteering to 
support younger colleagues who feel alone. 
We need collaboration in academia to tackle 
these problems. Black scientists can’t carry 
the burden on their own. 

Universities can help faculty members to 
learn how to steer difficult conversations, 
and to acknowledge the emotional burden 
of systemic racism. Department heads can 
encourage their faculty members to initiate and 
maintain these discussions with their teams. 

At the institutional level, so much of the 
diversity conversation right now is focused 
on recruiting more students of colour and 
finding ways to support them once they arrive. 
And that’s important, but it’s a very different 
question from the one that the Black commu- 
nity is struggling with right now: we're asking 
what we can do to prevent the next instance 
of police brutality. I saw that the University 
of Minnesota in Minneapolis had severed a 
contract with the city’s police department, 
which had provided patrols at large campus 
events, because of concerns about violence. 
Those are bold, immediate actions that get to 
the root of the problem. 

On social media, I saw medical students at 
the University of Washington in Seattle initiate 
an anti-racism summer reading programme. 
I really hope that tenured white faculty mem- 
bers will do the research, too, and be able to 
say: “I’ve learnt how racism, discrimination 
and implicit bias affect my Black colleagues.” 
And then they will realize at the next faculty 
meeting that they can start conversations that 
their Black colleagues cannot start without 
putting their careers on the line. 


Nikea Pittman is a structural biologist at the 
University of North Carolina, Chapel Hill. 


HENRY HENDERSON 
CREATE A WELCOMING 
ENVIRONMENT 


I trained at Tuskegee University in Alabama, 
a historically Black university, where every- 
one looked like me. But I started to question 
whether I wanted to continue in academia 
when I went to national conferences and saw 
so few people of colour. Yet I realized that, if 
I stepped out, it would decrease the chances 
for someone else who looked like me to climb 
the academic ladder. 

At Vanderbilt University in Nashville, 
Tennessee, lam one of just two Black postdocs 
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ona floor with at least ten labs. It reminds me 
there’s still alot todo. We have programmes to 
support diversity and inclusion, but that’s not 
enough. I’ve experienced microaggressions. 
At a poster presentation at a large meeting, 
every question got asked was, “Who did all this 
work?”, even though | was first author. It’s not 
unusual for white colleagues to insinuate that 
people of colour are in certain programmes 
only because their research is being supported 
by supplementary grants for members of 
minority groups. If you create opportunities, 
honour the policy in place to increase diversity. 

When I come to the lab in the evening, 1am 
usually followed by campus police from the 
parking complex to the building, and they ask 
to see my badge but let white people go right 
by. I’m used to that from city police. My mother 
wants me to wear my ID badge everywhere] go, 
in the hope that it will keep me safe. 

I had considered leaving academia several 
times. My current principal investigator, 
oncologist Christine Lovly, saved me. She 
strives to foster opportunities, and puts 
me in leadership roles — inviting me, for 
example, into a group focused on health 
disparities in cancer — so that I can have a 
voice and gain experience promoting dia- 
logue in areas relevant to my research. This 
approach took me by surprise at first. | won- 
dered whether she was overworking me, but 
she wanted to increase my impact in spaces 
that I otherwise wouldn't be in. She’s encour- 
aged me to apply for a US National Institutes 
of Health grant, even though I was hesitant 
because of the gap between success rates 
for white and African American applicants. 
I wouldn't be on Twitter if it weren’t for her. 
She saw that I wanted to diversify science, 
and created this postdoc opportunity so that 


I could reach back and pull others up, too. 
Creating a welcoming environment is very 
different from inviting Black people to uni- 
versity functions, academic programmes and 
professional society meetings simply to lift our 
numbers. We can tell when we're here to be 
here as opposed to being tokenized. In a wel- 
coming space, we are asked to speak at talks, to 
offer input, to collaborate and to lead projects. 
I worry that there are periods during 
which society is outraged by racism, only for 
everything to go back to the way it was within 
a couple of months. We need a continuous 
effort to improve racial diversity in science and 
medicine. Institutions should create more pro- 
grammes that emphasize not only diversity, 
but also the retention of Black students inthe 
scientific pipeline, including support through- 
out their education. I have six nephews who 
are all interested in science because of me. 
Imagine if there were more Black academics. 
How many other kids would be inspired? 


Henry Henderson is a cancer biologist 
at Vanderbilt University Medical Center, 
Nashville, Tennessee. 


ABDULHAKIM ABDI 
MAKE HIRING FOR LEADERSHIP 
POSTS MORE TRANSPARENT 


I came to Sweden in 2012 to get a PhD at 
Lund. Before that, I worked as a geographic- 
information systems specialist at the 
Lamont-Doherty Earth Observatory in 
New York City. There, Robin Bell, head of the 
research group and current president of 
the American Geophysical Union, created an 


Physical geographer Abdulhakim Abdi calls for a sustained effort to improve racial diversity. 
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environment where everybody’s voice was 
heard and respected. That encouraged me 
to pursue a PhD and to ask my own research 
questions. 

At the same time, I experienced overt 
racism in the United States while exploring 
the outdoors as a birdwatcher. I grew up in 
Abu Dhabi and started observing birds when 
I was 14. After I moved to the United States at 
the age of 21, for my undergraduate degree, 
I had to explain to police at least half a dozen 
times that was simply outside watching birds. 
The police weren't aggressive, but they told 
me that I was making people uncomfortable 
and that I had to move along. So, for the rest 
of the 11 years that I was in the United States, 
I stopped birdwatching. Here in Sweden, 
people might give me strange looks when 
I’m outdoors, but they don’t say anything or 
call the police. In everyday encounters here, 
it can be difficult to discern whether people 
are being racist or whether they are simply 
introverted culturally. 

The student body is pretty diverse in 
universities across Sweden, but less so in the 
upper echelons. I’m worried about the lack of 
transparency in hiring and promotion deci- 
sions. What goes on behind closed doors? 
These problems were detailed ina 2018 report 
on the prevalence of nepotism in Swedish 
academia (see go.nature.com/3ip75Yh). 

As the report suggested, a combination 
of increased funding, to help alleviate uni- 
versities’ reliance on hiring ‘safe bets’, and 
greater transparency in hiring should reduce 
nepotism and hopefully eliminate it. Swedish 
institutions have strong gender-equality pro- 
grammes, as they should. But there are no 
programmes that open the higher echelons 
of academia to qualified and capable research- 
ers from the country’s recognized minority 
groups; these include the Sami and the Roma, 
as wellas other established communities such 
as people from Iraq and Somalia. 

Still, when I was offered a position in the 
United States soon after I earned my PhD, I 
decided to stay in Sweden with my family — it 
made more sense financially because we have 
subsidized day care and health care. 

I think every university should create an 
ecosystem where Black scientists and those 
from other minority groups feel comfortable. 
Academics need to let go of the myth that 
they have no biases. They need to respect 
Black people’s opinions and invite Black and 
minority-ethnic academics to be co-authors 
of papers. And they need to expand their net- 
works to create opportunities for colleagues 
from under-represented groups. 


Abdulhakim Abdi is a physical geographer at 
Lund University, Sweden. 


Interviews by Virginia Gewin. Interviews have 
been edited for clarity and length. 
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hey call me a mechatronics engineer 
—ahybrid between a roboticist 

and a mechanical engineer. Inthe 
Spacecraft Assembly Facility at the 
Jet Propulsion Laboratory (JPL) 
in Pasadena, California, we build and test 
spacecraft and rovers, the vehicles that can 
traverse the surface of a destination. 

In February, we finished the main assembly 
of the Mars 2020 rover, Perseverance — due 
to launch this month or in August — whose 
mission is to seek signs of ancient life. What’s 
exciting about this rover is that it can take 
samples of Martian soil, analyse them and 
seal them in containers for a future mission 
to collect and bring back to Earth. The rover’s 
complex machinery can drill down to extract 
core samples 4-5 centimetres long, then 
bring them inside the rover to process and 
photograph. 

On any mission, we try to ensure that no 
volatile chemicals, plastics or paints are 
deposited on sensitive surfaces such as 
optical lenses. We also take care to avoid 
contaminating the Mars surface with Earth 
organisms such as bacteria or spores, which 


© 2020 Springer Nature Limited. All rights reserved. 


could interfere with getting accurate results. 
We work in the giant ‘clean room’ seen here. 
Before entering, we put on full protective 
gear: little has changed at the JPL in that 
respect since the COVID-19 lockdown began. 

With Mars 2020, we have to be even 
more careful because some of the samples 
collected will come back. When we're 
working with equipment that will process 
soil cores, we wear an extra set of single-use, 
sterile gloves, and a sterile lab gown and 
goggles. A buddy hands me tools, anything I 
need, so that I don’t contaminate my gloves. 

Since February, I’ve been based at the 
Kennedy Space Center in Cape Canaveral, 
Florida, helping to put the finishing touches 
to the rover. 

As I ungown for the day, it’s nice to takea 
step back and think how we've sent rovers 
to other planets only a handful of times. It’s 
pretty monumental. 


Zach Ousnameer is the integration and test 
engineer for NASA’s Mars 2020 mission at 
the Jet Propulsion Laboratory in Pasadena, 
California. Interview by Amber Dance. 


